Skip to content

Commit 041f564

Browse files
prattmicgopherbot
authored andcommitted
internal/runtime/gc/scan: avoid memory destination on VPCOMPRESSQ
On AMD Genoa / Zen 4, VPCOMPRESSQ with a memory destination imposes a severe performance penalty of another an order of magnitude compared to a register destination. We can trivially work around this penalty with a register destination and an additional move to memory. Benchmark results from: $ go test -bench=BenchmarkScanSpanPacked/.*/.*/.*/.*/impl=Platform internal/runtime/gc/scan I've only included the summarized geomean here because there are ~2500 unique test cases. AMD Genoa (Zen 4): cpu: AMD EPYC 9B14 96-Core Processor │ mem │ reg │ │ sec/op │ sec/op vs base │ geomean 1.039µ 310.1n -70.16% │ mem │ reg │ │ B/s │ B/s vs base │ geomean 2.906Gi 10.99Gi +278.27% As expected, we see a massive performance improvement on Genoa. AMD Turin (Zen 5): cpu: AMD EPYC 9B45 128-Core Processor │ mem │ reg │ │ sec/op │ sec/op vs base │ geomean 231.9n 237.3n +2.32% │ mem │ reg │ │ B/s │ B/s vs base │ geomean 14.79Gi 14.43Gi -2.50% On Turin there is a minor regression. This is primarily due to a fairly large regression (~15%) in very small microbenchmark cases where the entire memory fits in L1 cache. This regression disappears as memory access slows down with larger memories. The latter should be more common in real workloads. Intel Sapphire Rapids: cpu: Intel(R) Xeon(R) Platinum 8481C │ mem │ reg │ │ sec/op │ sec/op vs base │ geomean 254.9n 246.8n -3.18% │ mem │ reg │ │ B/s │ B/s vs base │ geomean 13.65Gi 14.15Gi +3.69% On Sapphire Rapids there is a minor improvement. Here results are fairly noisy. Most cases are a wash, but some are arbitrary 20% slower or 20% faster for unclear reasons. For #73581. Change-Id: I6a6a636cfd294a0dcdc4f34c9ece1bc9a6e5e4c7 Reviewed-on: https://go-review.googlesource.com/c/go/+/715362 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Junyang Shao <shaojunyang@google.com>
1 parent 81afd3a commit 041f564

File tree

1 file changed

+18
-1
lines changed

1 file changed

+18
-1
lines changed

src/internal/runtime/gc/scan/scan_amd64.s

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,24 @@ loop:
8686

8787
// Collect just the pointers from the greyed objects into the scan buffer,
8888
// i.e., copy the word indices in the mask from Z1 into contiguous memory.
89-
VPCOMPRESSQ Z1, K1, (DI)(DX*8)
89+
//
90+
// N.B. VPCOMPRESSQ supports a memory destination. Unfortunately, on
91+
// AMD Genoa / Zen 4, using VPCOMPRESSQ with a memory destination
92+
// imposes a severe performance penalty of around an order of magnitude
93+
// compared to a register destination.
94+
//
95+
// This workaround is unfortunate on other microarchitectures, where a
96+
// memory destination is slightly faster than adding an additional move
97+
// instruction, but no where near an order of magnitude. It would be
98+
// nice to have a Genoa-only variant here.
99+
//
100+
// AMD Turin / Zen 5 fixes this issue.
101+
//
102+
// See
103+
// https://lemire.me/blog/2025/02/14/avx-512-gotcha-avoid-compressing-words-to-memory-with-amd-zen-4-processors/.
104+
VPCOMPRESSQ Z1, K1, Z2
105+
VMOVDQU64 Z2, (DI)(DX*8)
106+
90107
// Advance the scan buffer position by the number of pointers.
91108
MOVBQZX 128(AX), CX
92109
ADDQ CX, DX

0 commit comments

Comments
 (0)