ch perf: Implement comprehensive performance stabilization framework #4085

vyadavmsft · 2025-10-30T18:54:27Z

ch perf: Implement comprehensive performance stabilization framework

Add advanced performance tuning and diagnostic capabilities to Cloud
Hypervisor tests for stable and reproducible benchmark results.

Performance Controls:

CPU/NUMA pinning via CH_NUMA_NODE and CH_PIN_VCPUS env vars
CPU governor management (performance/powersave modes)
Transparent hugepage (THP) configuration
IRQ affinity tuning for optimal interrupt handling
Performance knob logging for test reproducibility

This framework enables fine-grained control over the performance test
environment, reducing variance and improving reliability of Cloud
Hypervisor performance benchmarks.

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

squirrelsc · 2025-11-06T00:01:28Z

@LiliDeng LGTM

anirudhrb

Based on my understanding and the information provided in the comments and PR description, I'm not convinced that the "anchor gate" thing actually achieves its stated goal. In fact, I'm not even sure if it adds any value to our perf tests. I'm open to changing my mind though if you have more data to show.

PS: I haven't yet fully reviewed this PR. Please wait for me before merging.

anirudhrb · 2025-11-07T13:31:10Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+
+                # Retry once
+                self._log.debug("Retrying anchor gate...")
+                time.sleep(5)


Why do we need to sleep here?

anirudhrb · 2025-11-07T13:33:53Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+        elapsed = time.time() - start
+
+        # Calculate throughput (GB/s) - 2048 MB = 2.048 GB
+        throughput = 2.048 / elapsed


elapsed is not really that accurate here. It also includes the time spent in establishing SSH connection and also the round trip time taken to send/receive data over the networking.

anirudhrb · 2025-11-07T13:36:34Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+        # 5-8s CPU/mem anchor using dd
+        start = time.time()
+        self.node.execute(
+            "dd if=/dev/zero of=/dev/null bs=1M count=2048 status=none",


There is no guarantee that this runs in 5-8s like the comment on top says. It might be better to use something else where we can explicitly specify the time.

anirudhrb · 2025-11-07T13:38:57Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+        """
+        Anchor gate: 5-8s CPU/mem warmup to validate system stability.
+
+        Validates against EWMA baseline (±5%). Retries once on failure.


Why 5%? How was this chosen?

anirudhrb · 2025-11-07T13:39:12Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+        Anchor gate: 5-8s CPU/mem warmup to validate system stability.
+
+        Validates against EWMA baseline (±5%). Retries once on failure.
+        Uses exponential weighted moving average (alpha=0.3) for stability.


Why 0.3? How did you arrive at this number?

anirudhrb · 2025-11-07T13:53:48Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+
+                if deviation > self.ANCHOR_DEVIATION_THRESHOLD:
+                    self._log.debug(
+                        f"Anchor gate FAILED on retry: "


What would we do with this information? This doesn't give us any actionable insights. Again, it was calculated with only 2 data points

Today: it annotates the run. If we later see elevated CV or an odd first iteration, we can correlate that to an anchor warning during triage.

anirudhrb · 2025-11-07T13:53:59Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+
+    def _run_anchor_gate(self) -> None:
+        """
+        Anchor gate: 5-8s CPU/mem warmup to validate system stability.


Running dd twice and finding some deviation in the elapsed time doesn't sound like an effective benchmark to determine whether a system is stable.

Totally agree it’s a minimal signal chosen for zero dependencies on our minimal images. dd by itself doesn’t prove stability; it just provides a fast “don’t-start-yet” signal
we can replace with something else in follow up , if we choose to.

anirudhrb · 2025-11-07T13:55:21Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+        if "PERF_DISKS" not in self.env_vars:
+            self.env_vars["PERF_DISKS"] = "/dev/nvme0n1"
+
+        # Safe preconditioning via bounded file


What is preconditioning? Is it like warmup?

its different, Put SSD/storage into a known state before testing
One-time during storage hygiene setup (optional, gated by CH_ALLOW_DESTRUCTIVE=1). however it can be removed. since most CH metrics tests use their own test devices via DATADISK_NAME.

anirudhrb · 2025-11-07T13:56:20Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+
+        # Ensure O_DIRECT is used (check env vars)
+        if "PERF_DISKS" not in self.env_vars:
+            self.env_vars["PERF_DISKS"] = "/dev/nvme0n1"


Is this environment variable used inside the CH test suite somewhere to pick up the correct disk?

anirudhrb · 2025-11-07T13:58:10Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+        if self._is_mq_random_write_test(testcase):
+            warmup_seconds = max(warmup_seconds, self.MQ_WARMUP_MIN_SECONDS)
+
+        if warmup_seconds <= 0:


Is there any case where this would be true?

The if warmup_seconds <= 0 check is purely defensive, to handle:
Any future test harness override where warm-up is disabled globally (tool.perf_warmup_seconds = 0),
Or developer/debug runs that skip perf stabilization (for faster iteration).

anirudhrb · 2025-11-07T14:10:43Z

lisa/microsoft/testsuites/cloud_hypervisor/ch_tests_tool.py

+        if result.exit_code == 0 and result.stdout.strip():
+            self._log.info(f"  Frequencies: {result.stdout.strip()}")
+
+    def _run_warmup_for_test(self, testcase: str) -> None:


The CH perf tests framework already runs each test for multiple iterations and calculates the mean, standard deviation etc. This should already cancel out any effects of the system not being warmed up etc. In the worst-case scenario, it should be sufficiently warmed up after the first iteration.

Do you have comprehensive data show that doing this warmup helps?

The current CH perf harness repeats tests, but the first measured iteration often runs in a cold state: CPU P-state/C-state residency, disk queue/cache priming, page cache policy, TCP slow-start/conn-tracking, and IRQ distribution all settle during that first iteration. That first-run skew either drags the mean.

The warmup makes the first measured iteration comparable to subsequent ones, reducing autocorrelation between early/late samples so the mean/stdev reflect steady-state rather than “time-to-steady-state.”

Yes i have data with these various experiment , experiment12 is latest with this current code:

In summary:
The results of the subsystem representative tests have shown significant improvements. Here are the key findings:
For the Block I/O – Multi-Queue (throughput), the baseline CV % ranged from 12% to 22%, but after the warm-up in Experiment 12, the CV % dramatically reduced to 1% to 2%, resulting in a variance reduction of approximately -85% to -95%.
In the Block I/O – Random (4 K IOPS) tests, the baseline CV % was between 6% and 24%. Post warm-up, this was reduced to a range of 2% to 3%, reflecting a variance reduction of around -70% to -90%.
Sequential Block (Write MiB/s) tests presented a significant drop in CV % from 22.1% to 1.3%, indicating a variance reduction of -94%.
For Network – VirtIO Multi-Queue RX/TX pps, the baseline CV % values of 30.6% and 23.7% decreased to 0.9% and 1.5% respectively, averaging a variance reduction of -95%.
Boot/Init Latency (16 vCPU) tests showed an impressive reduction in CV % from 56.2% to 1.2%, achieving a variance reduction of -98%.

Overall, the median CV % across 31 tests decreased from 15% to 1.8%, marking an overall variance reduction of -88%.

What does CV mean here? Are all the values in the table CV% values?

Have you considered implementing this warmup in the Cloud Hypervisor perf tests framework itself? It can be modified so that before running a test a "warmup iteration" of the same test would be run once. The data from this warmup iteration would be discarded. Implementing this there would help both MSHV & KVM.

squirrelsc · 2025-11-07T18:06:05Z

Based on my understanding and the information provided in the comments and PR description, I'm not convinced that the "anchor gate" thing actually achieves its stated goal. In fact, I'm not even sure if it adds any value to our perf tests. I'm open to changing my mind though if you have more data to show.

PS: I haven't yet fully reviewed this PR. Please wait for me before merging.

@anirudhrb , please request changes on the PR, if you have more comments.

anirudhrb · 2025-11-07T18:08:53Z

Based on my understanding and the information provided in the comments and PR description, I'm not convinced that the "anchor gate" thing actually achieves its stated goal. In fact, I'm not even sure if it adds any value to our perf tests. I'm open to changing my mind though if you have more data to show.
PS: I haven't yet fully reviewed this PR. Please wait for me before merging.

@anirudhrb , please request changes on the PR, if you have more comments.

Yeah, I have done that.

vyadavmsft requested review from LiliDeng and squirrelsc as code owners October 30, 2025 18:54

squirrelsc reviewed Oct 31, 2025

View reviewed changes