Skip to content

Conversation

@vyadavmsft
Copy link
Collaborator

ch perf: Implement comprehensive performance stabilization framework

Add advanced performance tuning and diagnostic capabilities to Cloud
Hypervisor tests for stable and reproducible benchmark results.

Performance Controls:

  • CPU/NUMA pinning via CH_NUMA_NODE and CH_PIN_VCPUS env vars
  • CPU governor management (performance/powersave modes)
  • Transparent hugepage (THP) configuration
  • IRQ affinity tuning for optimal interrupt handling
  • Performance knob logging for test reproducibility

This framework enables fine-grained control over the performance test
environment, reducing variance and improving reliability of Cloud
Hypervisor performance benchmarks.

@squirrelsc
Copy link
Member

@LiliDeng LGTM

Copy link
Contributor

@anirudhrb anirudhrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding and the information provided in the comments and PR description, I'm not convinced that the "anchor gate" thing actually achieves its stated goal. In fact, I'm not even sure if it adds any value to our perf tests. I'm open to changing my mind though if you have more data to show.

PS: I haven't yet fully reviewed this PR. Please wait for me before merging.


# Retry once
self._log.debug("Retrying anchor gate...")
time.sleep(5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to sleep here?

elapsed = time.time() - start

# Calculate throughput (GB/s) - 2048 MB = 2.048 GB
throughput = 2.048 / elapsed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elapsed is not really that accurate here. It also includes the time spent in establishing SSH connection and also the round trip time taken to send/receive data over the networking.

# 5-8s CPU/mem anchor using dd
start = time.time()
self.node.execute(
"dd if=/dev/zero of=/dev/null bs=1M count=2048 status=none",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that this runs in 5-8s like the comment on top says. It might be better to use something else where we can explicitly specify the time.

"""
Anchor gate: 5-8s CPU/mem warmup to validate system stability.
Validates against EWMA baseline (±5%). Retries once on failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 5%? How was this chosen?

Anchor gate: 5-8s CPU/mem warmup to validate system stability.
Validates against EWMA baseline (±5%). Retries once on failure.
Uses exponential weighted moving average (alpha=0.3) for stability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 0.3? How did you arrive at this number?


if deviation > self.ANCHOR_DEVIATION_THRESHOLD:
self._log.debug(
f"Anchor gate FAILED on retry: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would we do with this information? This doesn't give us any actionable insights. Again, it was calculated with only 2 data points

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today: it annotates the run. If we later see elevated CV or an odd first iteration, we can correlate that to an anchor warning during triage.


def _run_anchor_gate(self) -> None:
"""
Anchor gate: 5-8s CPU/mem warmup to validate system stability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running dd twice and finding some deviation in the elapsed time doesn't sound like an effective benchmark to determine whether a system is stable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree it’s a minimal signal chosen for zero dependencies on our minimal images. dd by itself doesn’t prove stability; it just provides a fast “don’t-start-yet” signal
we can replace with something else in follow up , if we choose to.

if "PERF_DISKS" not in self.env_vars:
self.env_vars["PERF_DISKS"] = "/dev/nvme0n1"

# Safe preconditioning via bounded file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is preconditioning? Is it like warmup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its different, Put SSD/storage into a known state before testing
One-time during storage hygiene setup (optional, gated by CH_ALLOW_DESTRUCTIVE=1). however it can be removed. since most CH metrics tests use their own test devices via DATADISK_NAME.


# Ensure O_DIRECT is used (check env vars)
if "PERF_DISKS" not in self.env_vars:
self.env_vars["PERF_DISKS"] = "/dev/nvme0n1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this environment variable used inside the CH test suite somewhere to pick up the correct disk?

if self._is_mq_random_write_test(testcase):
warmup_seconds = max(warmup_seconds, self.MQ_WARMUP_MIN_SECONDS)

if warmup_seconds <= 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where this would be true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if warmup_seconds <= 0 check is purely defensive, to handle:
Any future test harness override where warm-up is disabled globally (tool.perf_warmup_seconds = 0),
Or developer/debug runs that skip perf stabilization (for faster iteration).

if result.exit_code == 0 and result.stdout.strip():
self._log.info(f" Frequencies: {result.stdout.strip()}")

def _run_warmup_for_test(self, testcase: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CH perf tests framework already runs each test for multiple iterations and calculates the mean, standard deviation etc. This should already cancel out any effects of the system not being warmed up etc. In the worst-case scenario, it should be sufficiently warmed up after the first iteration.

Do you have comprehensive data show that doing this warmup helps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current CH perf harness repeats tests, but the first measured iteration often runs in a cold state: CPU P-state/C-state residency, disk queue/cache priming, page cache policy, TCP slow-start/conn-tracking, and IRQ distribution all settle during that first iteration. That first-run skew either drags the mean.

The warmup makes the first measured iteration comparable to subsequent ones, reducing autocorrelation between early/late samples so the mean/stdev reflect steady-state rather than “time-to-steady-state.”

Yes i have data with these various experiment , experiment12 is latest with this current code:

image

In summary:
The results of the subsystem representative tests have shown significant improvements. Here are the key findings:
For the Block I/O – Multi-Queue (throughput), the baseline CV % ranged from 12% to 22%, but after the warm-up in Experiment 12, the CV % dramatically reduced to 1% to 2%, resulting in a variance reduction of approximately -85% to -95%.
In the Block I/O – Random (4 K IOPS) tests, the baseline CV % was between 6% and 24%. Post warm-up, this was reduced to a range of 2% to 3%, reflecting a variance reduction of around -70% to -90%.
Sequential Block (Write MiB/s) tests presented a significant drop in CV % from 22.1% to 1.3%, indicating a variance reduction of -94%.
For Network – VirtIO Multi-Queue RX/TX pps, the baseline CV % values of 30.6% and 23.7% decreased to 0.9% and 1.5% respectively, averaging a variance reduction of -95%.
Boot/Init Latency (16 vCPU) tests showed an impressive reduction in CV % from 56.2% to 1.2%, achieving a variance reduction of -98%.

Overall, the median CV % across 31 tests decreased from 15% to 1.8%, marking an overall variance reduction of -88%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does CV mean here? Are all the values in the table CV% values?

Have you considered implementing this warmup in the Cloud Hypervisor perf tests framework itself? It can be modified so that before running a test a "warmup iteration" of the same test would be run once. The data from this warmup iteration would be discarded. Implementing this there would help both MSHV & KVM.

@squirrelsc
Copy link
Member

Based on my understanding and the information provided in the comments and PR description, I'm not convinced that the "anchor gate" thing actually achieves its stated goal. In fact, I'm not even sure if it adds any value to our perf tests. I'm open to changing my mind though if you have more data to show.

PS: I haven't yet fully reviewed this PR. Please wait for me before merging.

@anirudhrb , please request changes on the PR, if you have more comments.

@anirudhrb
Copy link
Contributor

Based on my understanding and the information provided in the comments and PR description, I'm not convinced that the "anchor gate" thing actually achieves its stated goal. In fact, I'm not even sure if it adds any value to our perf tests. I'm open to changing my mind though if you have more data to show.
PS: I haven't yet fully reviewed this PR. Please wait for me before merging.

@anirudhrb , please request changes on the PR, if you have more comments.

Yeah, I have done that.

@squirrelsc squirrelsc requested a review from LiliDeng November 7, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants