Skip to content

Commit 251814e

Browse files
mknyszekgopherbot
authored andcommitted
runtime: document tracer invariants explicitly
This change is a documentation update for the execution tracer. Far too much is left to small comments scattered around places. This change accumulates the big important trace invariants, with rationale, into one file: trace.go. Change-Id: I5fd1402a3d8fdf14a0051e305b3a8fb5dfeafcb3 Reviewed-on: https://go-review.googlesource.com/c/go/+/708398 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
1 parent 7244e92 commit 251814e

File tree

4 files changed

+198
-56
lines changed

4 files changed

+198
-56
lines changed

src/runtime/HACKING.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -518,9 +518,8 @@ The parser for the execution trace format lives in the `internal/trace` package.
518518
If you plan on adding new trace events, consider starting with a [trace
519519
experiment](../internal/trace/tracev2/EXPERIMENTS.md).
520520

521-
If you plan to add new trace instrumentation to the runtime, wrap whatever operation
522-
you're tracing in `traceAcquire` and `traceRelease` fully. These functions mark a
523-
critical section that appears atomic to the execution tracer (but nothing else).
521+
If you plan to add new trace instrumentation to the runtime, read the comment
522+
at the top of [trace.go](./trace.go), especially the invariants.
524523

525524
debuglog
526525
========

src/runtime/proc.go

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2170,8 +2170,6 @@ func forEachPInternal(fn func(*p)) {
21702170
// Force Ps currently in a system call into _Pidle and hand them
21712171
// off to induce safe point function execution.
21722172
for _, p2 := range allp {
2173-
// We need to be fine-grained about tracing here, since handoffp
2174-
// might call into the tracer, and the tracer is non-reentrant.
21752173
if atomic.Load(&p2.runSafePointFn) != 1 {
21762174
// Already ran it.
21772175
continue
@@ -4344,21 +4342,26 @@ func preemptPark(gp *g) {
43444342
casGToPreemptScan(gp, _Grunning, _Gscan|_Gpreempted)
43454343
dropg()
43464344

4347-
// Be careful about how we trace this next event. The ordering
4348-
// is subtle.
4345+
// Be careful about ownership as we trace this next event.
43494346
//
4350-
// The moment we CAS into _Gpreempted, suspendG could CAS to
4351-
// _Gwaiting, do its work, and ready the goroutine. All of
4347+
// According to the tracer invariants (trace.go) it's unsafe
4348+
// for us to emit an event for a goroutine we do not own.
4349+
// The moment we CAS into _Gpreempted, suspendG could CAS the
4350+
// goroutine to _Gwaiting, effectively taking ownership. All of
43524351
// this could happen before we even get the chance to emit
43534352
// an event. The end result is that the events could appear
43544353
// out of order, and the tracer generally assumes the scheduler
43554354
// takes care of the ordering between GoPark and GoUnpark.
43564355
//
43574356
// The answer here is simple: emit the event while we still hold
4358-
// the _Gscan bit on the goroutine. We still need to traceAcquire
4359-
// and traceRelease across the CAS because the tracer could be
4360-
// what's calling suspendG in the first place, and we want the
4361-
// CAS and event emission to appear atomic to the tracer.
4357+
// the _Gscan bit on the goroutine, since the _Gscan bit means
4358+
// ownership over transitions.
4359+
//
4360+
// We still need to traceAcquire and traceRelease across the CAS
4361+
// because the tracer could be what's calling suspendG in the first
4362+
// place. This also upholds the tracer invariant that we must hold
4363+
// traceAcquire/traceRelease across the transition. However, we
4364+
// specifically *only* emit the event while we still have ownership.
43624365
trace := traceAcquire()
43634366
if trace.ok() {
43644367
trace.GoPark(traceBlockPreempted, 0)
@@ -4598,7 +4601,8 @@ func reentersyscall(pc, sp, bp uintptr) {
45984601
if trace.ok() {
45994602
// Emit a trace event. Notably, actually emitting the event must happen before
46004603
// the casgstatus because it mutates the P, but the traceLocker must be held
4601-
// across the casgstatus so the transition is atomic with respect to the event.
4604+
// across the casgstatus since we're transitioning out of _Grunning
4605+
// (see trace.go invariants).
46024606
systemstack(func() {
46034607
trace.GoSysCall()
46044608
})
@@ -5249,7 +5253,8 @@ func newproc1(fn *funcval, callergp *g, callerpc uintptr, parked bool, waitreaso
52495253
}
52505254
gcController.addScannableStack(pp, int64(newg.stack.hi-newg.stack.lo))
52515255

5252-
// Get a goid and switch to runnable. Make all this atomic to the tracer.
5256+
// Get a goid and switch to runnable. This needs to happen under traceAcquire
5257+
// since it's a goroutine transition. See tracer invariants in trace.go.
52535258
trace := traceAcquire()
52545259
var status uint32 = _Grunnable
52555260
if parked {

src/runtime/trace.go

Lines changed: 176 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,182 @@
22
// Use of this source code is governed by a BSD-style
33
// license that can be found in the LICENSE file.
44

5-
// Go execution tracer.
5+
// # Go execution tracer
6+
//
67
// The tracer captures a wide range of execution events like goroutine
78
// creation/blocking/unblocking, syscall enter/exit/block, GC-related events,
89
// changes of heap size, processor start/stop, etc and writes them to a buffer
910
// in a compact form. A precise nanosecond-precision timestamp and a stack
1011
// trace is captured for most events.
1112
//
12-
// Tracer invariants (to keep the synchronization making sense):
13-
// - An m that has a trace buffer must be on either the allm or sched.freem lists.
14-
// - Any trace buffer mutation must either be happening in traceAdvance or between
15-
// a traceAcquire and a subsequent traceRelease.
16-
// - traceAdvance cannot return until the previous generation's buffers are all flushed.
13+
// ## Design
14+
//
15+
// The basic idea behind the the execution tracer is to have per-M buffers that
16+
// trace data may be written into. Each M maintains a seqlock indicating whether
17+
// its trace buffer is currently in use.
18+
//
19+
// Tracing is initiated by StartTrace, and proceeds in "generations," with each
20+
// generation being marked by a call to traceAdvance, to advance to the next
21+
// generation. Generations are a global synchronization point for trace data,
22+
// and we proceed to a new generation by moving forward trace.gen. Each M reads
23+
// trace.gen under its own seqlock to determine which generation it is writing
24+
// trace data for. To this end, each M has 2 slots for buffers: one slot for the
25+
// previous generation, one slot for the current one. It uses tl.gen to select
26+
// which buffer slot to write to. Simultaneously, traceAdvance uses the seqlock
27+
// to determine whether every thread is guaranteed to observe an updated
28+
// trace.gen. Once it is sure, it may then flush any buffers that are left over
29+
// from the previous generation safely, since it knows the Ms will not mutate
30+
// it.
31+
//
32+
// Flushed buffers are processed by the ReadTrace function, which is called by
33+
// the trace reader goroutine. The first goroutine to call ReadTrace is designated
34+
// as the trace reader goroutine until tracing completes. (There may only be one at
35+
// a time.)
36+
//
37+
// Once all buffers are flushed, any extra post-processing complete, and flushed
38+
// buffers are processed by the trace reader goroutine, the trace emits an
39+
// EndOfGeneration event to mark the global synchronization point in the trace.
40+
//
41+
// All other trace features, including CPU profile samples, stack information,
42+
// string tables, etc. all revolve around this generation system, and typically
43+
// appear in pairs: one for the previous generation, and one for the current one.
44+
// Like the per-M buffers, which of the two is written to is selected using trace.gen,
45+
// and anything managed this way must similarly be mutated only in traceAdvance or
46+
// under the M's seqlock.
47+
//
48+
// Trace events themselves are simple. They consist of a single byte for the event type,
49+
// followed by zero or more LEB128-encoded unsigned varints. They are decoded using
50+
// a pre-determined table for each trace version: internal/trace/tracev2.specs.
51+
//
52+
// To avoid relying on timestamps for correctness and validation, each G and P have
53+
// sequence counters that are written into trace events to encode a partial order.
54+
// The sequence counters reset on each generation. Ms do not need sequence counters
55+
// because they are the source of truth for execution: trace events, and even whole
56+
// buffers, are guaranteed to appear in order in the trace data stream, simply because
57+
// that's the order the thread emitted them in.
58+
//
59+
// See traceruntime.go for the API the tracer exposes to the runtime for emitting events.
60+
//
61+
// In each generation, we ensure that we enumerate all goroutines, such that each
62+
// generation's data is fully self-contained. This makes features like the flight
63+
// recorder easy to implement. To this end, we guarantee that every live goroutine is
64+
// listed at least once by emitting a status event for the goroutine, indicating its
65+
// starting state. These status events are emitted based on context, generally based
66+
// on the event that's about to be emitted.
67+
//
68+
// The traceEventWriter type encapsulates these details, and is the backbone of
69+
// the API exposed in traceruntime.go, though there are deviations where necessary.
70+
//
71+
// This is the overall design, but as always, there are many details. Beyond this,
72+
// look to the invariants and select corner cases below and the code itself for the
73+
// source of truth.
74+
//
75+
// See https://go.dev/issue/60773 for a link to a more complete design with rationale,
76+
// though parts of it are out-of-date.
77+
//
78+
// ## Invariants
79+
//
80+
// 1. An m that has a trace buffer MUST be on either the allm or sched.freem lists.
81+
//
82+
// Otherwise, traceAdvance might miss an M with a buffer that needs to be flushed.
83+
//
84+
// 2. Trace buffers MUST only be mutated in traceAdvance or under a traceAcquire/traceRelease.
85+
//
86+
// Otherwise, traceAdvance may race with Ms writing trace data when trying to flush buffers.
87+
//
88+
// 3. traceAdvance MUST NOT return until all of the current generation's buffers are flushed.
89+
//
90+
// Otherwise, callers cannot rely on all the data they need being available (for example, for
91+
// the flight recorder).
1792
//
18-
// See https://go.dev/issue/60773 for a link to the full design.
93+
// 4. P and goroutine state transition events MUST be emitted by an M that owns its ability
94+
// to transition.
95+
//
96+
// What this means is that the M must either be the owner of the P, the owner of the goroutine,
97+
// or owner of a non-running goroutine's _Gscan bit. There are a lot of bad things that can
98+
// happen if this invariant isn't maintained, mostly around generating inconsistencies in the
99+
// trace due to racy emission of events.
100+
//
101+
// 5. Acquisition of a P (pidleget or takeP/gcstopP) MUST NOT be performed under a traceAcquire/traceRelease pair.
102+
//
103+
// Notably, it's important that traceAcquire/traceRelease not cover a state in which the
104+
// goroutine or P is not yet owned. For example, if traceAcquire is held across both wirep and
105+
// pidleget, then we could end up emitting an event in the wrong generation. Suppose T1
106+
// traceAcquires in generation 1, a generation transition happens, T2 emits a ProcStop and
107+
// executes pidleput in generation 2, and finally T1 calls pidleget and emits ProcStart.
108+
// The ProcStart must follow the ProcStop in the trace to make any sense, but ProcStop was
109+
// emitted in a latter generation.
110+
//
111+
// 6. Goroutine state transitions, with the exception of transitions into _Grunning, MUST be
112+
// performed under the traceAcquire/traceRelease pair where the event is emitted.
113+
//
114+
// Otherwise, traceAdvance may observe a goroutine state that is inconsistent with the
115+
// events being emitted. traceAdvance inspects all goroutines' states in order to emit
116+
// a status event for any goroutine that did not have an event emitted for it already.
117+
// If the generation then advances in between that observation and the event being emitted,
118+
// then the trace will contain a status that doesn't line up with the event. For example,
119+
// if the event is emitted after the state transition _Gwaiting -> _Grunnable, then
120+
// traceAdvance may observe the goroutine in _Grunnable, emit a status event, advance the
121+
// generation, and the following generation contains a GoUnblock event. The trace parser
122+
// will get confused because it sees that goroutine in _Grunnable in the previous generation
123+
// trying to be transitioned from _Gwaiting into _Grunnable in the following one. Something
124+
// similar happens if the trace event is emitted before the state transition, so that does
125+
// not help either.
126+
//
127+
// Transitions to _Grunning do not have the same problem because traceAdvance is unable to
128+
// observe running goroutines directly. It must stop them, or wait for them to emit an event.
129+
// Note that it cannot even stop them with asynchronous preemption in any "bad" window between
130+
// the state transition to _Grunning and the event emission because async preemption cannot
131+
// stop goroutines in the runtime.
132+
//
133+
// 7. Goroutine state transitions into _Grunning MUST emit an event for the transition after
134+
// the state transition.
135+
//
136+
// This follows from invariants (4), (5), and the explanation of (6).
137+
// The relevant part of the previous invariant is that in order for the tracer to be unable to
138+
// stop a goroutine, it must be in _Grunning and in the runtime. So to close any windows between
139+
// event emission and the state transition, the event emission must happen *after* the transition
140+
// to _Grunning.
141+
//
142+
// ## Select corner cases
143+
//
144+
// ### CGO calls / system calls
145+
//
146+
// CGO calls and system calls are mostly straightforward, except for P stealing. For historical
147+
// reasons, this introduces a new trace-level P state called ProcSyscall which used to model
148+
// _Psyscall (now _Psyscall_unused). This state is used to indicate in the trace that a P
149+
// is eligible for stealing as part of the parser's ordering logic.
150+
//
151+
// Another quirk of this corner case is the ProcSyscallAbandoned trace-level P state, which
152+
// is used only in status events to indicate a relaxation of verification requirements. It
153+
// means that if the execution trace parser can't find the corresponding thread that the P
154+
// was stolen from in the state it expects it to be, to accept the trace anyway. This is also
155+
// historical. When _Psyscall still existed, one would steal and then ProcSteal, and there
156+
// was no ordering between the ProcSteal and the subsequent GoSyscallEndBlocked. One clearly
157+
// happened before the other, but since P stealing was a single atomic, there was no way
158+
// to enforce the order. The GoSyscallEndBlocked thread could move on and end up in any
159+
// state, and the GoSyscallEndBlocked could be in a completely different generation to the
160+
// ProcSteal. Today this is no longer possible as the ProcSteal is always ordered before
161+
// the GoSyscallEndBlocked event in the runtime.
162+
//
163+
// Both ProcSyscall and ProcSyscallAbandoned are likely no longer be necessary.
164+
//
165+
// ### CGO callbacks
166+
//
167+
// When a C thread calls into Go, the execution tracer models that as the creation of a new
168+
// goroutine. When the thread exits back into C, that is modeled as the destruction of that
169+
// goroutine. These are the GoCreateSyscall and GoDestroySyscall events, which represent the
170+
// creation and destruction of a goroutine with its starting and ending states being _Gsyscall.
171+
//
172+
// This model is simple to reason about but contradicts the runtime implementation, which
173+
// doesn't do this directly for performance reasons. The runtime implementation instead caches
174+
// a G on the M created for the C thread. On Linux this M is then cached in the thread's TLS,
175+
// and on other systems, the M is put on a global list on exit from Go. We need to do some
176+
// extra work to make sure that this is modeled correctly in the the tracer. For example,
177+
// a C thread exiting Go may leave a P hanging off of its M (whether that M is kept in TLS
178+
// or placed back on a list). In order to correctly model goroutine creation and destruction,
179+
// we must behave as if the P was at some point stolen by the runtime, if the C thread
180+
// reenters Go with the same M (and thus, same P) once more.
19181

20182
package runtime
21183

@@ -192,38 +354,14 @@ func StartTrace() error {
192354

193355
// Stop the world.
194356
//
195-
// The purpose of stopping the world is to make sure that no goroutine is in a
196-
// context where it could emit an event by bringing all goroutines to a safe point
197-
// with no opportunity to transition.
198-
//
199-
// The exception to this rule are goroutines that are concurrently exiting a syscall.
200-
// Those will all be forced into the syscalling slow path, and we'll just make sure
201-
// that we don't observe any goroutines in that critical section before starting
202-
// the world again.
203-
//
204-
// A good follow-up question to this is why stopping the world is necessary at all
205-
// given that we have traceAcquire and traceRelease. Unfortunately, those only help
206-
// us when tracing is already active (for performance, so when tracing is off the
207-
// tracing seqlock is left untouched). The main issue here is subtle: we're going to
208-
// want to obtain a correct starting status for each goroutine, but there are windows
209-
// of time in which we could read and emit an incorrect status. Specifically:
210-
//
211-
// trace := traceAcquire()
212-
// // <----> problem window
213-
// casgstatus(gp, _Gwaiting, _Grunnable)
214-
// if trace.ok() {
215-
// trace.GoUnpark(gp, 2)
216-
// traceRelease(trace)
217-
// }
218-
//
219-
// More precisely, if we readgstatus for a gp while another goroutine is in the problem
220-
// window and that goroutine didn't observe that tracing had begun, then we might write
221-
// a GoStatus(GoWaiting) event for that goroutine, but it won't trace an event marking
222-
// the transition from GoWaiting to GoRunnable. The trace will then be broken, because
223-
// future events will be emitted assuming the tracer sees GoRunnable.
357+
// What we need to successfully begin tracing is to make sure that the next time
358+
// *any goroutine* hits a traceAcquire, it sees that the trace is enabled.
224359
//
225-
// In short, what we really need here is to make sure that the next time *any goroutine*
226-
// hits a traceAcquire, it sees that the trace is enabled.
360+
// Stopping the world gets us most of the way there, since it makes sure that goroutines
361+
// stop executing. There is however one exception: goroutines without Ps concurrently
362+
// exiting a syscall. We handle this by making sure that, after we update trace.gen,
363+
// there isn't a single goroutine calling traceAcquire on the syscall slow path by checking
364+
// trace.exitingSyscall. See the comment on the check below for more details.
227365
//
228366
// Note also that stopping the world is necessary to make sure sweep-related events are
229367
// coherent. Since the world is stopped and sweeps are non-preemptible, we can never start

src/runtime/traceruntime.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -541,10 +541,10 @@ func (tl traceLocker) ProcSteal(pp *p) {
541541
pp.trace.mSyscallID = -1
542542

543543
// Emit the status of the P we're stealing. We may be just about to do this when creating the event
544-
// writer but it's not guaranteed, even if inSyscall is true. Although it might seem like from a
545-
// syscall context we're always stealing a P for ourselves, we may have not wired it up yet (so
544+
// writer but it's not guaranteed, even if we're stealing from a syscall. Although it might seem like
545+
// from a syscall context we're always stealing a P for ourselves, we may have not wired it up yet (so
546546
// it wouldn't be visible to eventWriter) or we may not even intend to wire it up to ourselves
547-
// at all (e.g. entersyscall_gcwait).
547+
// at all and plan to hand it back to the runtime.
548548
if !pp.trace.statusWasTraced(tl.gen) && pp.trace.acquireStatus(tl.gen) {
549549
// Careful: don't use the event writer. We never want status or in-progress events
550550
// to trigger more in-progress events.

0 commit comments

Comments
 (0)