Skip to content

Commit 5058618

Browse files
zhangfanniemknyszek
authored andcommitted
runtime: use backoff and ISB instruction to reduce contention in (*lfstack).pop and (*spanSet).pop on arm64
When profiling CPU usage LiveKit on AArch64/x86 (AWS), the graphs show CPU spikes that was repeating in a semi-periodic manner and spikes occur when the GC(garbage collector) is active. Our analysis found that the getempty function accounted for 10.54% of the overhead, which was mainly caused by the work.empty.pop() function. And listing pop shows that the majority of the time, with a 10.29% overhead, is spent on atomic.Cas64((*uint64)(head), old, next). This patch adds a backoff approach to reduce the high overhead of the atomic operation primarily occurs when contention over a specific memory address increases, typically with the rise in the number of threads. Note that on paltforms other than arm64, the initial value of backoff is zero. This patch rewrites the implementation of procyield() on arm64, which is an Armv8.0-A compatible delay function using the counter-timer. The garbage collector benchmark: │ master │ opt │ │ sec/op │ sec/op vs base │ Garbage/benchmem-MB=64-160 3.782m ± 4% 2.264m ± 2% -40.12% (p=0.000 n=10) │ user+sys-sec/op │ user+sys-sec/op vs base │ Garbage/benchmem-MB=64-160 433.5m ± 4% 255.4m ± 2% -41.08% (p=0.000 n=10) Reference for backoff mechianism: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm Change-Id: Ie8128a2243ceacbb82ab2a88941acbb8428bad94 Reviewed-on: https://go-review.googlesource.com/c/go/+/654895 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
1 parent 1ff59f3 commit 5058618

File tree

3 files changed

+80
-5
lines changed

3 files changed

+80
-5
lines changed

src/runtime/asm_arm64.s

Lines changed: 52 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1036,13 +1036,60 @@ aesloop:
10361036
VMOV V0.D[0], R0
10371037
RET
10381038

1039+
// The Arm architecture provides a user space accessible counter-timer which
1040+
// is incremented at a fixed but machine-specific rate. Software can (spin)
1041+
// wait until the counter-timer reaches some desired value.
1042+
//
1043+
// Armv8.7-A introduced the WFET (FEAT_WFxT) instruction, which allows the
1044+
// processor to enter a low power state for a set time, or until an event is
1045+
// received.
1046+
//
1047+
// However, WFET is not used here because it is only available on newer hardware,
1048+
// and we aim to maintain compatibility with older Armv8-A platforms that do not
1049+
// support this feature.
1050+
//
1051+
// As a fallback, we can instead use the ISB instruction to decrease processor
1052+
// activity and thus power consumption between checks of the counter-timer.
1053+
// Note that we do not depend on the latency of the ISB instruction which is
1054+
// implementation specific. Actual delay comes from comparing against a fresh
1055+
// read of the counter-timer value.
1056+
//
1057+
// Read more in this Arm blog post:
1058+
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
1059+
10391060
TEXT runtime·procyieldAsm(SB),NOSPLIT,$0-0
10401061
MOVWU cycles+0(FP), R0
1041-
CBZ R0, done
1042-
again:
1043-
YIELD
1044-
SUBW $1, R0
1045-
CBNZ R0, again
1062+
CBZ R0, done
1063+
//Prevent speculation of subsequent counter/timer reads and memory accesses.
1064+
ISB $15
1065+
// If the delay is very short, just return.
1066+
// Hardcode 18ns as the first ISB delay.
1067+
CMP $18, R0
1068+
BLS done
1069+
// Adjust for overhead of initial ISB.
1070+
SUB $18, R0, R0
1071+
// Convert the delay from nanoseconds to counter/timer ticks.
1072+
// Read the counter/timer frequency.
1073+
// delay_ticks = (delay * CNTFRQ_EL0) / 1e9
1074+
// With the below simplifications and adjustments,
1075+
// we are usually within 2% of the correct value:
1076+
// delay_ticks = (delay + delay / 16) * CNTFRQ_EL0 >> 30
1077+
MRS CNTFRQ_EL0, R1
1078+
ADD R0>>4, R0, R0
1079+
MUL R1, R0, R0
1080+
LSR $30, R0, R0
1081+
CBZ R0, done
1082+
// start = current counter/timer value
1083+
MRS CNTVCT_EL0, R2
1084+
delay:
1085+
// Delay using ISB for all ticks.
1086+
ISB $15
1087+
// Substract and compare to handle counter roll-over.
1088+
// counter_read() - start < delay_ticks
1089+
MRS CNTVCT_EL0, R1
1090+
SUB R2, R1, R1
1091+
CMP R0, R1
1092+
BCC delay
10461093
done:
10471094
RET
10481095

src/runtime/lfstack.go

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,11 @@ func (head *lfstack) push(node *lfnode) {
3434
}
3535

3636
func (head *lfstack) pop() unsafe.Pointer {
37+
var backoff uint32
38+
// TODO: tweak backoff parameters on other architectures.
39+
if GOARCH == "arm64" {
40+
backoff = 128
41+
}
3742
for {
3843
old := atomic.Load64((*uint64)(head))
3944
if old == 0 {
@@ -44,6 +49,16 @@ func (head *lfstack) pop() unsafe.Pointer {
4449
if atomic.Cas64((*uint64)(head), old, next) {
4550
return unsafe.Pointer(node)
4651
}
52+
53+
// Use a backoff approach to reduce demand to the shared memory location
54+
// decreases memory contention and allows for other threads to make quicker
55+
// progress.
56+
// Read more in this Arm blog post:
57+
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
58+
procyield(backoff)
59+
// Increase backoff time.
60+
backoff += backoff / 2
61+
4762
}
4863
}
4964

src/runtime/mspanset.go

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,11 @@ retry:
149149
// pop is safe to call concurrently with other pop and push operations.
150150
func (b *spanSet) pop() *mspan {
151151
var head, tail uint32
152+
var backoff uint32
153+
// TODO: tweak backoff parameters on other architectures.
154+
if GOARCH == "arm64" {
155+
backoff = 128
156+
}
152157
claimLoop:
153158
for {
154159
headtail := b.index.load()
@@ -177,6 +182,14 @@ claimLoop:
177182
if b.index.cas(headtail, makeHeadTailIndex(want+1, tail)) {
178183
break claimLoop
179184
}
185+
// Use a backoff approach to reduce demand to the shared memory location
186+
// decreases memory contention and allows for other threads to make quicker
187+
// progress.
188+
// Read more in this Arm blog post:
189+
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
190+
procyield(backoff)
191+
// Increase backoff time.
192+
backoff += backoff / 2
180193
headtail = b.index.load()
181194
head, tail = headtail.split()
182195
}

0 commit comments

Comments
 (0)