-
Notifications
You must be signed in to change notification settings - Fork 221
ch perf: Implement comprehensive performance stabilization framework #4085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2b6506e to
67f01fa
Compare
67f01fa to
23fb7d9
Compare
7e54481 to
491a503
Compare
491a503 to
6efec39
Compare
6efec39 to
19a48c0
Compare
|
@LiliDeng LGTM |
anirudhrb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding and the information provided in the comments and PR description, I'm not convinced that the "anchor gate" thing actually achieves its stated goal. In fact, I'm not even sure if it adds any value to our perf tests. I'm open to changing my mind though if you have more data to show.
PS: I haven't yet fully reviewed this PR. Please wait for me before merging.
|
|
||
| # Retry once | ||
| self._log.debug("Retrying anchor gate...") | ||
| time.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to sleep here?
| elapsed = time.time() - start | ||
|
|
||
| # Calculate throughput (GB/s) - 2048 MB = 2.048 GB | ||
| throughput = 2.048 / elapsed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elapsed is not really that accurate here. It also includes the time spent in establishing SSH connection and also the round trip time taken to send/receive data over the networking.
| # 5-8s CPU/mem anchor using dd | ||
| start = time.time() | ||
| self.node.execute( | ||
| "dd if=/dev/zero of=/dev/null bs=1M count=2048 status=none", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that this runs in 5-8s like the comment on top says. It might be better to use something else where we can explicitly specify the time.
| """ | ||
| Anchor gate: 5-8s CPU/mem warmup to validate system stability. | ||
| Validates against EWMA baseline (±5%). Retries once on failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 5%? How was this chosen?
| Anchor gate: 5-8s CPU/mem warmup to validate system stability. | ||
| Validates against EWMA baseline (±5%). Retries once on failure. | ||
| Uses exponential weighted moving average (alpha=0.3) for stability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 0.3? How did you arrive at this number?
|
|
||
| if deviation > self.ANCHOR_DEVIATION_THRESHOLD: | ||
| self._log.debug( | ||
| f"Anchor gate FAILED on retry: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we do with this information? This doesn't give us any actionable insights. Again, it was calculated with only 2 data points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today: it annotates the run. If we later see elevated CV or an odd first iteration, we can correlate that to an anchor warning during triage.
|
|
||
| def _run_anchor_gate(self) -> None: | ||
| """ | ||
| Anchor gate: 5-8s CPU/mem warmup to validate system stability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running dd twice and finding some deviation in the elapsed time doesn't sound like an effective benchmark to determine whether a system is stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree it’s a minimal signal chosen for zero dependencies on our minimal images. dd by itself doesn’t prove stability; it just provides a fast “don’t-start-yet” signal
we can replace with something else in follow up , if we choose to.
| if "PERF_DISKS" not in self.env_vars: | ||
| self.env_vars["PERF_DISKS"] = "/dev/nvme0n1" | ||
|
|
||
| # Safe preconditioning via bounded file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is preconditioning? Is it like warmup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its different, Put SSD/storage into a known state before testing
One-time during storage hygiene setup (optional, gated by CH_ALLOW_DESTRUCTIVE=1). however it can be removed. since most CH metrics tests use their own test devices via DATADISK_NAME.
|
|
||
| # Ensure O_DIRECT is used (check env vars) | ||
| if "PERF_DISKS" not in self.env_vars: | ||
| self.env_vars["PERF_DISKS"] = "/dev/nvme0n1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this environment variable used inside the CH test suite somewhere to pick up the correct disk?
| if self._is_mq_random_write_test(testcase): | ||
| warmup_seconds = max(warmup_seconds, self.MQ_WARMUP_MIN_SECONDS) | ||
|
|
||
| if warmup_seconds <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case where this would be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if warmup_seconds <= 0 check is purely defensive, to handle:
Any future test harness override where warm-up is disabled globally (tool.perf_warmup_seconds = 0),
Or developer/debug runs that skip perf stabilization (for faster iteration).
| if result.exit_code == 0 and result.stdout.strip(): | ||
| self._log.info(f" Frequencies: {result.stdout.strip()}") | ||
|
|
||
| def _run_warmup_for_test(self, testcase: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CH perf tests framework already runs each test for multiple iterations and calculates the mean, standard deviation etc. This should already cancel out any effects of the system not being warmed up etc. In the worst-case scenario, it should be sufficiently warmed up after the first iteration.
Do you have comprehensive data show that doing this warmup helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current CH perf harness repeats tests, but the first measured iteration often runs in a cold state: CPU P-state/C-state residency, disk queue/cache priming, page cache policy, TCP slow-start/conn-tracking, and IRQ distribution all settle during that first iteration. That first-run skew either drags the mean.
The warmup makes the first measured iteration comparable to subsequent ones, reducing autocorrelation between early/late samples so the mean/stdev reflect steady-state rather than “time-to-steady-state.”
Yes i have data with these various experiment , experiment12 is latest with this current code:
In summary:
The results of the subsystem representative tests have shown significant improvements. Here are the key findings:
For the Block I/O – Multi-Queue (throughput), the baseline CV % ranged from 12% to 22%, but after the warm-up in Experiment 12, the CV % dramatically reduced to 1% to 2%, resulting in a variance reduction of approximately -85% to -95%.
In the Block I/O – Random (4 K IOPS) tests, the baseline CV % was between 6% and 24%. Post warm-up, this was reduced to a range of 2% to 3%, reflecting a variance reduction of around -70% to -90%.
Sequential Block (Write MiB/s) tests presented a significant drop in CV % from 22.1% to 1.3%, indicating a variance reduction of -94%.
For Network – VirtIO Multi-Queue RX/TX pps, the baseline CV % values of 30.6% and 23.7% decreased to 0.9% and 1.5% respectively, averaging a variance reduction of -95%.
Boot/Init Latency (16 vCPU) tests showed an impressive reduction in CV % from 56.2% to 1.2%, achieving a variance reduction of -98%.
Overall, the median CV % across 31 tests decreased from 15% to 1.8%, marking an overall variance reduction of -88%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does CV mean here? Are all the values in the table CV% values?
Have you considered implementing this warmup in the Cloud Hypervisor perf tests framework itself? It can be modified so that before running a test a "warmup iteration" of the same test would be run once. The data from this warmup iteration would be discarded. Implementing this there would help both MSHV & KVM.
@anirudhrb , please request changes on the PR, if you have more comments. |
Yeah, I have done that. |
ch perf: Implement comprehensive performance stabilization framework
Add advanced performance tuning and diagnostic capabilities to Cloud
Hypervisor tests for stable and reproducible benchmark results.
Performance Controls:
This framework enables fine-grained control over the performance test
environment, reducing variance and improving reliability of Cloud
Hypervisor performance benchmarks.