|
2 | 2 | // Use of this source code is governed by a BSD-style |
3 | 3 | // license that can be found in the LICENSE file. |
4 | 4 |
|
5 | | -// Go execution tracer. |
| 5 | +// # Go execution tracer |
| 6 | +// |
6 | 7 | // The tracer captures a wide range of execution events like goroutine |
7 | 8 | // creation/blocking/unblocking, syscall enter/exit/block, GC-related events, |
8 | 9 | // changes of heap size, processor start/stop, etc and writes them to a buffer |
9 | 10 | // in a compact form. A precise nanosecond-precision timestamp and a stack |
10 | 11 | // trace is captured for most events. |
11 | 12 | // |
12 | | -// Tracer invariants (to keep the synchronization making sense): |
13 | | -// - An m that has a trace buffer must be on either the allm or sched.freem lists. |
14 | | -// - Any trace buffer mutation must either be happening in traceAdvance or between |
15 | | -// a traceAcquire and a subsequent traceRelease. |
16 | | -// - traceAdvance cannot return until the previous generation's buffers are all flushed. |
| 13 | +// ## Design |
| 14 | +// |
| 15 | +// The basic idea behind the the execution tracer is to have per-M buffers that |
| 16 | +// trace data may be written into. Each M maintains a seqlock indicating whether |
| 17 | +// its trace buffer is currently in use. |
| 18 | +// |
| 19 | +// Tracing is initiated by StartTrace, and proceeds in "generations," with each |
| 20 | +// generation being marked by a call to traceAdvance, to advance to the next |
| 21 | +// generation. Generations are a global synchronization point for trace data, |
| 22 | +// and we proceed to a new generation by moving forward trace.gen. Each M reads |
| 23 | +// trace.gen under its own seqlock to determine which generation it is writing |
| 24 | +// trace data for. To this end, each M has 2 slots for buffers: one slot for the |
| 25 | +// previous generation, one slot for the current one. It uses tl.gen to select |
| 26 | +// which buffer slot to write to. Simultaneously, traceAdvance uses the seqlock |
| 27 | +// to determine whether every thread is guaranteed to observe an updated |
| 28 | +// trace.gen. Once it is sure, it may then flush any buffers that are left over |
| 29 | +// from the previous generation safely, since it knows the Ms will not mutate |
| 30 | +// it. |
| 31 | +// |
| 32 | +// Flushed buffers are processed by the ReadTrace function, which is called by |
| 33 | +// the trace reader goroutine. The first goroutine to call ReadTrace is designated |
| 34 | +// as the trace reader goroutine until tracing completes. (There may only be one at |
| 35 | +// a time.) |
| 36 | +// |
| 37 | +// Once all buffers are flushed, any extra post-processing complete, and flushed |
| 38 | +// buffers are processed by the trace reader goroutine, the trace emits an |
| 39 | +// EndOfGeneration event to mark the global synchronization point in the trace. |
| 40 | +// |
| 41 | +// All other trace features, including CPU profile samples, stack information, |
| 42 | +// string tables, etc. all revolve around this generation system, and typically |
| 43 | +// appear in pairs: one for the previous generation, and one for the current one. |
| 44 | +// Like the per-M buffers, which of the two is written to is selected using trace.gen, |
| 45 | +// and anything managed this way must similarly be mutated only in traceAdvance or |
| 46 | +// under the M's seqlock. |
| 47 | +// |
| 48 | +// Trace events themselves are simple. They consist of a single byte for the event type, |
| 49 | +// followed by zero or more LEB128-encoded unsigned varints. They are decoded using |
| 50 | +// a pre-determined table for each trace version: internal/trace/tracev2.specs. |
| 51 | +// |
| 52 | +// To avoid relying on timestamps for correctness and validation, each G and P have |
| 53 | +// sequence counters that are written into trace events to encode a partial order. |
| 54 | +// The sequence counters reset on each generation. Ms do not need sequence counters |
| 55 | +// because they are the source of truth for execution: trace events, and even whole |
| 56 | +// buffers, are guaranteed to appear in order in the trace data stream, simply because |
| 57 | +// that's the order the thread emitted them in. |
| 58 | +// |
| 59 | +// See traceruntime.go for the API the tracer exposes to the runtime for emitting events. |
| 60 | +// |
| 61 | +// In each generation, we ensure that we enumerate all goroutines, such that each |
| 62 | +// generation's data is fully self-contained. This makes features like the flight |
| 63 | +// recorder easy to implement. To this end, we guarantee that every live goroutine is |
| 64 | +// listed at least once by emitting a status event for the goroutine, indicating its |
| 65 | +// starting state. These status events are emitted based on context, generally based |
| 66 | +// on the event that's about to be emitted. |
| 67 | +// |
| 68 | +// The traceEventWriter type encapsulates these details, and is the backbone of |
| 69 | +// the API exposed in traceruntime.go, though there are deviations where necessary. |
| 70 | +// |
| 71 | +// This is the overall design, but as always, there are many details. Beyond this, |
| 72 | +// look to the invariants and select corner cases below and the code itself for the |
| 73 | +// source of truth. |
| 74 | +// |
| 75 | +// See https://go.dev/issue/60773 for a link to a more complete design with rationale, |
| 76 | +// though parts of it are out-of-date. |
| 77 | +// |
| 78 | +// ## Invariants |
| 79 | +// |
| 80 | +// 1. An m that has a trace buffer MUST be on either the allm or sched.freem lists. |
| 81 | +// |
| 82 | +// Otherwise, traceAdvance might miss an M with a buffer that needs to be flushed. |
| 83 | +// |
| 84 | +// 2. Trace buffers MUST only be mutated in traceAdvance or under a traceAcquire/traceRelease. |
| 85 | +// |
| 86 | +// Otherwise, traceAdvance may race with Ms writing trace data when trying to flush buffers. |
| 87 | +// |
| 88 | +// 3. traceAdvance MUST NOT return until all of the current generation's buffers are flushed. |
| 89 | +// |
| 90 | +// Otherwise, callers cannot rely on all the data they need being available (for example, for |
| 91 | +// the flight recorder). |
17 | 92 | // |
18 | | -// See https://go.dev/issue/60773 for a link to the full design. |
| 93 | +// 4. P and goroutine state transition events MUST be emitted by an M that owns its ability |
| 94 | +// to transition. |
| 95 | +// |
| 96 | +// What this means is that the M must either be the owner of the P, the owner of the goroutine, |
| 97 | +// or owner of a non-running goroutine's _Gscan bit. There are a lot of bad things that can |
| 98 | +// happen if this invariant isn't maintained, mostly around generating inconsistencies in the |
| 99 | +// trace due to racy emission of events. |
| 100 | +// |
| 101 | +// 5. Acquisition of a P (pidleget or takeP/gcstopP) MUST NOT be performed under a traceAcquire/traceRelease pair. |
| 102 | +// |
| 103 | +// Notably, it's important that traceAcquire/traceRelease not cover a state in which the |
| 104 | +// goroutine or P is not yet owned. For example, if traceAcquire is held across both wirep and |
| 105 | +// pidleget, then we could end up emitting an event in the wrong generation. Suppose T1 |
| 106 | +// traceAcquires in generation 1, a generation transition happens, T2 emits a ProcStop and |
| 107 | +// executes pidleput in generation 2, and finally T1 calls pidleget and emits ProcStart. |
| 108 | +// The ProcStart must follow the ProcStop in the trace to make any sense, but ProcStop was |
| 109 | +// emitted in a latter generation. |
| 110 | +// |
| 111 | +// 6. Goroutine state transitions, with the exception of transitions into _Grunning, MUST be |
| 112 | +// performed under the traceAcquire/traceRelease pair where the event is emitted. |
| 113 | +// |
| 114 | +// Otherwise, traceAdvance may observe a goroutine state that is inconsistent with the |
| 115 | +// events being emitted. traceAdvance inspects all goroutines' states in order to emit |
| 116 | +// a status event for any goroutine that did not have an event emitted for it already. |
| 117 | +// If the generation then advances in between that observation and the event being emitted, |
| 118 | +// then the trace will contain a status that doesn't line up with the event. For example, |
| 119 | +// if the event is emitted after the state transition _Gwaiting -> _Grunnable, then |
| 120 | +// traceAdvance may observe the goroutine in _Grunnable, emit a status event, advance the |
| 121 | +// generation, and the following generation contains a GoUnblock event. The trace parser |
| 122 | +// will get confused because it sees that goroutine in _Grunnable in the previous generation |
| 123 | +// trying to be transitioned from _Gwaiting into _Grunnable in the following one. Something |
| 124 | +// similar happens if the trace event is emitted before the state transition, so that does |
| 125 | +// not help either. |
| 126 | +// |
| 127 | +// Transitions to _Grunning do not have the same problem because traceAdvance is unable to |
| 128 | +// observe running goroutines directly. It must stop them, or wait for them to emit an event. |
| 129 | +// Note that it cannot even stop them with asynchronous preemption in any "bad" window between |
| 130 | +// the state transition to _Grunning and the event emission because async preemption cannot |
| 131 | +// stop goroutines in the runtime. |
| 132 | +// |
| 133 | +// 7. Goroutine state transitions into _Grunning MUST emit an event for the transition after |
| 134 | +// the state transition. |
| 135 | +// |
| 136 | +// This follows from invariants (4), (5), and the explanation of (6). |
| 137 | +// The relevant part of the previous invariant is that in order for the tracer to be unable to |
| 138 | +// stop a goroutine, it must be in _Grunning and in the runtime. So to close any windows between |
| 139 | +// event emission and the state transition, the event emission must happen *after* the transition |
| 140 | +// to _Grunning. |
| 141 | +// |
| 142 | +// ## Select corner cases |
| 143 | +// |
| 144 | +// ### CGO calls / system calls |
| 145 | +// |
| 146 | +// CGO calls and system calls are mostly straightforward, except for P stealing. For historical |
| 147 | +// reasons, this introduces a new trace-level P state called ProcSyscall which used to model |
| 148 | +// _Psyscall (now _Psyscall_unused). This state is used to indicate in the trace that a P |
| 149 | +// is eligible for stealing as part of the parser's ordering logic. |
| 150 | +// |
| 151 | +// Another quirk of this corner case is the ProcSyscallAbandoned trace-level P state, which |
| 152 | +// is used only in status events to indicate a relaxation of verification requirements. It |
| 153 | +// means that if the execution trace parser can't find the corresponding thread that the P |
| 154 | +// was stolen from in the state it expects it to be, to accept the trace anyway. This is also |
| 155 | +// historical. When _Psyscall still existed, one would steal and then ProcSteal, and there |
| 156 | +// was no ordering between the ProcSteal and the subsequent GoSyscallEndBlocked. One clearly |
| 157 | +// happened before the other, but since P stealing was a single atomic, there was no way |
| 158 | +// to enforce the order. The GoSyscallEndBlocked thread could move on and end up in any |
| 159 | +// state, and the GoSyscallEndBlocked could be in a completely different generation to the |
| 160 | +// ProcSteal. Today this is no longer possible as the ProcSteal is always ordered before |
| 161 | +// the GoSyscallEndBlocked event in the runtime. |
| 162 | +// |
| 163 | +// Both ProcSyscall and ProcSyscallAbandoned are likely no longer be necessary. |
| 164 | +// |
| 165 | +// ### CGO callbacks |
| 166 | +// |
| 167 | +// When a C thread calls into Go, the execution tracer models that as the creation of a new |
| 168 | +// goroutine. When the thread exits back into C, that is modeled as the destruction of that |
| 169 | +// goroutine. These are the GoCreateSyscall and GoDestroySyscall events, which represent the |
| 170 | +// creation and destruction of a goroutine with its starting and ending states being _Gsyscall. |
| 171 | +// |
| 172 | +// This model is simple to reason about but contradicts the runtime implementation, which |
| 173 | +// doesn't do this directly for performance reasons. The runtime implementation instead caches |
| 174 | +// a G on the M created for the C thread. On Linux this M is then cached in the thread's TLS, |
| 175 | +// and on other systems, the M is put on a global list on exit from Go. We need to do some |
| 176 | +// extra work to make sure that this is modeled correctly in the the tracer. For example, |
| 177 | +// a C thread exiting Go may leave a P hanging off of its M (whether that M is kept in TLS |
| 178 | +// or placed back on a list). In order to correctly model goroutine creation and destruction, |
| 179 | +// we must behave as if the P was at some point stolen by the runtime, if the C thread |
| 180 | +// reenters Go with the same M (and thus, same P) once more. |
19 | 181 |
|
20 | 182 | package runtime |
21 | 183 |
|
@@ -192,38 +354,14 @@ func StartTrace() error { |
192 | 354 |
|
193 | 355 | // Stop the world. |
194 | 356 | // |
195 | | - // The purpose of stopping the world is to make sure that no goroutine is in a |
196 | | - // context where it could emit an event by bringing all goroutines to a safe point |
197 | | - // with no opportunity to transition. |
198 | | - // |
199 | | - // The exception to this rule are goroutines that are concurrently exiting a syscall. |
200 | | - // Those will all be forced into the syscalling slow path, and we'll just make sure |
201 | | - // that we don't observe any goroutines in that critical section before starting |
202 | | - // the world again. |
203 | | - // |
204 | | - // A good follow-up question to this is why stopping the world is necessary at all |
205 | | - // given that we have traceAcquire and traceRelease. Unfortunately, those only help |
206 | | - // us when tracing is already active (for performance, so when tracing is off the |
207 | | - // tracing seqlock is left untouched). The main issue here is subtle: we're going to |
208 | | - // want to obtain a correct starting status for each goroutine, but there are windows |
209 | | - // of time in which we could read and emit an incorrect status. Specifically: |
210 | | - // |
211 | | - // trace := traceAcquire() |
212 | | - // // <----> problem window |
213 | | - // casgstatus(gp, _Gwaiting, _Grunnable) |
214 | | - // if trace.ok() { |
215 | | - // trace.GoUnpark(gp, 2) |
216 | | - // traceRelease(trace) |
217 | | - // } |
218 | | - // |
219 | | - // More precisely, if we readgstatus for a gp while another goroutine is in the problem |
220 | | - // window and that goroutine didn't observe that tracing had begun, then we might write |
221 | | - // a GoStatus(GoWaiting) event for that goroutine, but it won't trace an event marking |
222 | | - // the transition from GoWaiting to GoRunnable. The trace will then be broken, because |
223 | | - // future events will be emitted assuming the tracer sees GoRunnable. |
| 357 | + // What we need to successfully begin tracing is to make sure that the next time |
| 358 | + // *any goroutine* hits a traceAcquire, it sees that the trace is enabled. |
224 | 359 | // |
225 | | - // In short, what we really need here is to make sure that the next time *any goroutine* |
226 | | - // hits a traceAcquire, it sees that the trace is enabled. |
| 360 | + // Stopping the world gets us most of the way there, since it makes sure that goroutines |
| 361 | + // stop executing. There is however one exception: goroutines without Ps concurrently |
| 362 | + // exiting a syscall. We handle this by making sure that, after we update trace.gen, |
| 363 | + // there isn't a single goroutine calling traceAcquire on the syscall slow path by checking |
| 364 | + // trace.exitingSyscall. See the comment on the check below for more details. |
227 | 365 | // |
228 | 366 | // Note also that stopping the world is necessary to make sure sweep-related events are |
229 | 367 | // coherent. Since the world is stopped and sweeps are non-preemptible, we can never start |
|
0 commit comments