internal/runtime/gc/scan: avoid memory destination on VPCOMPRESSQ

prattmic · gopherbot · commit 041f564b3e6f · 2025-10-28T13:58:12.000-07:00
On AMD Genoa / Zen 4, VPCOMPRESSQ with a memory destination imposes a severe performance penalty of another an order of magnitude compared to a register destination. We can trivially work around this penalty with a register destination and an additional move to memory. Benchmark results from: $ go test -bench=BenchmarkScanSpanPacked/.*/.*/.*/.*/impl=Platform internal/runtime/gc/scan I've only included the summarized geomean here because there are ~2500 unique test cases. AMD Genoa (Zen 4): cpu: AMD EPYC 9B14 96-Core Processor │ mem │ reg │ │ sec/op │ sec/op vs base │ geomean 1.039µ 310.1n -70.16% │ mem │ reg │ │ B/s │ B/s vs base │ geomean 2.906Gi 10.99Gi +278.27% As expected, we see a massive performance improvement on Genoa. AMD Turin (Zen 5): cpu: AMD EPYC 9B45 128-Core Processor │ mem │ reg │ │ sec/op │ sec/op vs base │ geomean 231.9n 237.3n +2.32% │ mem │ reg │ │ B/s │ B/s vs base │ geomean 14.79Gi 14.43Gi -2.50% On Turin there is a minor regression. This is primarily due to a fairly large regression (~15%) in very small microbenchmark cases where the entire memory fits in L1 cache. This regression disappears as memory access slows down with larger memories. The latter should be more common in real workloads. Intel Sapphire Rapids: cpu: Intel(R) Xeon(R) Platinum 8481C │ mem │ reg │ │ sec/op │ sec/op vs base │ geomean 254.9n 246.8n -3.18% │ mem │ reg │ │ B/s │ B/s vs base │ geomean 13.65Gi 14.15Gi +3.69% On Sapphire Rapids there is a minor improvement. Here results are fairly noisy. Most cases are a wash, but some are arbitrary 20% slower or 20% faster for unclear reasons. For #73581. Change-Id: I6a6a636cfd294a0dcdc4f34c9ece1bc9a6e5e4c7 Reviewed-on: https://go-review.googlesource.com/c/go/+/715362 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Junyang Shao <shaojunyang@google.com>
diff --git a/src/internal/runtime/gc/scan/scan_amd64.s b/src/internal/runtime/gc/scan/scan_amd64.s
@@ -86,7 +86,24 @@ loop:
 
 	// Collect just the pointers from the greyed objects into the scan buffer,
 	// i.e., copy the word indices in the mask from Z1 into contiguous memory.
-	VPCOMPRESSQ Z1, K1, (DI)(DX*8)
+	//
+	// N.B. VPCOMPRESSQ supports a memory destination. Unfortunately, on
+	// AMD Genoa / Zen 4, using VPCOMPRESSQ with a memory destination
+	// imposes a severe performance penalty of around an order of magnitude
+	// compared to a register destination.
+	//
+	// This workaround is unfortunate on other microarchitectures, where a
+	// memory destination is slightly faster than adding an additional move
+	// instruction, but no where near an order of magnitude. It would be
+	// nice to have a Genoa-only variant here.
+	//
+	// AMD Turin / Zen 5 fixes this issue.
+	//
+	// See
+	// https://lemire.me/blog/2025/02/14/avx-512-gotcha-avoid-compressing-words-to-memory-with-amd-zen-4-processors/.
+	VPCOMPRESSQ Z1, K1, Z2
+	VMOVDQU64 Z2, (DI)(DX*8)
+
 	// Advance the scan buffer position by the number of pointers.
 	MOVBQZX 128(AX), CX
 	ADDQ CX, DX