Skip to content

Commit c121576

Browse files
tmshortclaude
andcommitted
✨ Add e2e profiling toolchain for heap and CPU analysis
Add comprehensive profiling infrastructure to collect, analyze, and compare heap and CPU profiles during e2e test execution. **Two Profiling Workflows:** 1. **Start/Stop Workflow (Recommended)** - Start profiling with `make start-profiling` or `make start-profiling/<name>` - Run ANY test command (make test-e2e, make test-experimental-e2e, etc.) - Stop and analyze with `make stop-profiling` - Handles cluster teardown gracefully (auto-stops after 3 consecutive failures) - Works with tests that tear down clusters (like test-e2e) 2. **Automated Workflow** - Run integrated test with `./hack/tools/e2e-profiling/e2e-profile.sh run <name>` - Automatically handles profiling lifecycle - Best for scripted/automated profiling runs **Features:** - Automated heap and CPU profile collection from operator-controller and catalogd - Real-time profile capture every 10 seconds during test execution - CPU profiling with 10-second sampling windows running in parallel - Configurable profile modes: both (default), heap-only, or CPU-only - Multi-component profiling with separate analysis for each component - Prometheus alert tracking integrated with profiling reports - Side-by-side comparison of different test runs - Graceful cluster teardown detection and auto-stop **Tooling:** - `profile-collector-daemon.sh`: Background collection process (extracted for clarity) - `start-profiling.sh`: Start background profiling session - `stop-profiling.sh`: Stop profiling, cleanup, and analyze - `common.sh`: Shared library with logging, colors, config, and utilities - `collect-profiles.sh`: Profile collection loop (used by start/run workflows) - `analyze-profiles.sh`: Generate detailed analysis with top allocators, growth patterns, and CPU hotspots - `compare-profiles.sh`: Compare two test runs to identify regressions - `run-profiled-test.sh`: Orchestrate full profiled test runs (automated workflow) - `e2e-profile.sh`: Main entry point with subcommands (run/analyze/compare) **Architecture Improvements:** - **Modular design**: Extracted 250-line background process to separate script for maintainability - **Shared common library**: All scripts source `common.sh` for consistent logging, colors, and utilities - **Deployment-based port-forwarding**: Uses `deployment/` references instead of pod names for automatic failover - **Background execution**: Profiling runs in background using nohup, allowing any test command - **Intelligent retry logic**: 30-second timeout with 2-second intervals, tests components independently - **Robust cleanup (EXIT trap)**: Gracefully terminates processes, force-kills if stuck, removes empty profiles - **Multi-component support**: Profiles operator-controller and catalogd simultaneously in separate directories - **Cluster teardown detection**: Tracks consecutive failures, auto-stops after 3 failures when cluster is torn down **Code Quality:** - All scripts pass shellcheck with no warnings - Fixed SC2155 warnings: Separated variable declarations from assignments (19 instances) - Fixed SC2012 warnings: Replaced ls commands with find for better file handling (10 instances) - Combined consecutive heredoc blocks to reduce command invocations - 73% reduction in start-profiling.sh size (348 → 96 lines) via daemon extraction - Condensed README from 760 to 234 lines (70% reduction) for better usability **Makefile Integration:** - `make start-profiling` - Auto-generated timestamp name - `make start-profiling/<name>` - Custom profile name - `make stop-profiling` - Stop and analyze - All profiling targets in extended help (`make help-extended`) - Updated help regex to support pattern targets **Usage:** Start/Stop Workflow: ```bash make start-profiling make start-profiling/baseline make test-e2e # Works! Handles cluster teardown make test-experimental-e2e # Works! go test ./test/e2e/... # Works! make stop-profiling ``` Automated Workflow: ```bash ./hack/tools/e2e-profiling/e2e-profile.sh run baseline test-experimental-e2e E2E_PROFILE_MODE=heap ./hack/tools/e2e-profiling/e2e-profile.sh run memory-test ./hack/tools/e2e-profiling/e2e-profile.sh analyze baseline ./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized ``` **Configuration:** Set `E2E_PROFILE_MODE` environment variable: - `both` (default): Collect both heap and CPU profiles - `heap`: Collect only heap profiles (reduces overhead by ~3%) - `cpu`: Collect only CPU profiles **Integration:** - Automatic cleanup of empty profiles from cluster teardown - Prometheus alert extraction from e2e test summaries - Detailed markdown reports with memory growth, CPU usage analysis, and recommendations - Claude Code slash command integration (`/e2e-profile start/stop/run/analyze/compare`) **Key Implementation Details:** - Background profiling: Entire collection runs in nohup with exported environment variables - Fixed interval timing: INTERVAL now includes CPU profiling time, not adds to it - Deployment wait polls until deployments are created before checking availability - Component name sanitization: Hyphens converted to underscores for valid bash variable names - PID tracking for both background process and port-forward cleanup - Consecutive failure tracking: 3 failures triggers graceful auto-stop - Silent error handling: curl errors suppressed when cluster is being torn down - 10-second intervals accurately maintained across all profiling modes - Port-forwards remain stable throughout entire test duration and survive pod restarts - Conditional profile collection based on PROFILE_MODE setting - Cleanup runs on EXIT/INT/TERM with graceful shutdown (2.5s timeout) and force-kill - Code deduplication: Common functions extracted to shared library **Testing:** - All scripts pass shellcheck validation - Verified end-to-end with make test-e2e - Tested both default and named profiling patterns - Validated graceful cleanup and error handling - Confirmed help system correctly displays both patterns **Real-World Results:** This tooling was essential for identifying memory optimization opportunities and validating that alert thresholds are correctly calibrated. The OpenAPI caching optimization revealed through this tooling achieved 16.9% memory reduction and 73% reduction in OpenAPI-related allocations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>
1 parent 05ee601 commit c121576

18 files changed

+2225
-4
lines changed

.gitignore

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,13 @@ vendor/
3838
\#*\#
3939
.\#*
4040

41-
# AI temp files files
42-
.claude/
41+
# AI temp/local files
42+
.claude/settings.local.json
43+
.claude/history/
44+
.claude/cache/
45+
.claude/logs/
46+
.claude/.session*
47+
.claude/*.log
4348

4449
# documentation website asset folder
4550
site
@@ -50,3 +55,6 @@ site
5055

5156
# Temporary files and directories
5257
/test/regression/convert/testdata/tmp/*
58+
59+
# E2E profiling artifacts
60+
e2e-profiles/

Makefile

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,11 +106,11 @@ CATALOGS_MANIFEST := $(MANIFEST_HOME)/default-catalogs.yaml
106106

107107
.PHONY: help
108108
help: #HELP Display essential help.
109-
@awk 'BEGIN {FS = ":[^#]*#HELP"; printf "\nUsage:\n make \033[36m<target>\033[0m\n\n"} /^[a-zA-Z_0-9-]+:.*#HELP / { printf " \033[36m%-21s\033[0m %s\n", $$1, $$2 } ' $(MAKEFILE_LIST)
109+
@awk 'BEGIN {FS = ":[^#]*#HELP"; printf "\nUsage:\n make \033[36m<target>\033[0m\n\n"} /^[a-zA-Z_0-9\/%-]+:.*#HELP / { printf " \033[36m%-21s\033[0m %s\n", $$1, $$2 } ' $(MAKEFILE_LIST)
110110

111111
.PHONY: help-extended
112112
help-extended: #HELP Display extended help.
113-
@awk 'BEGIN {FS = ":.*#(EX)?HELP"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9-]+:.*#(EX)?HELP / { printf " \033[36m%-25s\033[0m %s\n", $$1, $$2 } /^#SECTION / { printf "\n\033[1m%s\033[0m\n", substr($$0, 10) } ' $(MAKEFILE_LIST)
113+
@awk 'BEGIN {FS = ":.*#(EX)?HELP"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9\/%-]+:.*#(EX)?HELP / { printf " \033[36m%-25s\033[0m %s\n", $$1, $$2 } /^#SECTION / { printf "\n\033[1m%s\033[0m\n", substr($$0, 10) } ' $(MAKEFILE_LIST)
114114

115115
#SECTION Development
116116

@@ -335,6 +335,18 @@ test-upgrade-experimental-e2e: $(TEST_UPGRADE_E2E_TASKS) #HELP Run upgrade e2e t
335335
e2e-coverage:
336336
COVERAGE_NAME=$(COVERAGE_NAME) ./hack/test/e2e-coverage.sh
337337

338+
.PHONY: start-profiling
339+
start-profiling: #EXHELP Start profiling in background with auto-generated name (timestamp). Use start-profiling/<name> for custom name.
340+
./hack/tools/e2e-profiling/start-profiling.sh
341+
342+
.PHONY: start-profiling/%
343+
start-profiling/%: #EXHELP Start profiling in background with specified name. Usage: make start-profiling/<name>
344+
./hack/tools/e2e-profiling/start-profiling.sh $*
345+
346+
.PHONY: stop-profiling
347+
stop-profiling: #EXHELP Stop profiling and generate analysis report
348+
./hack/tools/e2e-profiling/stop-profiling.sh
349+
338350
#SECTION KIND Cluster Operations
339351

340352
.PHONY: kind-load

hack/tools/e2e-profiling/README.md

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
# E2E Profiling Tools
2+
3+
Automated profiling and analysis for operator-controller e2e tests. Collect heap and CPU profiles during test runs, analyze memory usage patterns, and compare optimizations.
4+
5+
## Quick Start
6+
7+
### Simple Start/Stop Workflow
8+
9+
```bash
10+
# Start profiling (auto-generated name or specify custom name)
11+
make start-profiling
12+
make start-profiling/baseline
13+
14+
# Run your tests
15+
make test-e2e
16+
17+
# Stop and analyze
18+
make stop-profiling
19+
```
20+
21+
The profiler:
22+
- Waits for cluster components to be ready
23+
- Collects profiles every 10 seconds (configurable)
24+
- Handles cluster teardown automatically
25+
- Generates analysis reports
26+
27+
### Automated Test Runner
28+
29+
```bash
30+
# Run baseline
31+
./hack/tools/e2e-profiling/e2e-profile.sh run baseline
32+
33+
# Make changes, then run optimized version
34+
./hack/tools/e2e-profiling/e2e-profile.sh run optimized
35+
36+
# Compare results
37+
./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized
38+
```
39+
40+
View reports:
41+
```bash
42+
cat e2e-profiles/baseline/analysis.md
43+
cat e2e-profiles/comparisons/baseline-vs-optimized.md
44+
```
45+
46+
## Commands
47+
48+
### `run <name> [test-target]`
49+
Run e2e test with profiling.
50+
51+
```bash
52+
./hack/tools/e2e-profiling/e2e-profile.sh run my-test [test-e2e|test-experimental-e2e|...]
53+
```
54+
55+
Output:
56+
- `e2e-profiles/<name>/operator-controller/*.pprof` - Profile snapshots
57+
- `e2e-profiles/<name>/catalogd/*.pprof` - Catalogd profiles
58+
- `e2e-profiles/<name>/analysis.md` - Analysis report
59+
60+
### `analyze <name>`
61+
Analyze collected profiles.
62+
63+
```bash
64+
./hack/tools/e2e-profiling/e2e-profile.sh analyze my-test
65+
```
66+
67+
### `compare <test1> <test2>`
68+
Compare two test runs.
69+
70+
```bash
71+
./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized
72+
```
73+
74+
Output: `e2e-profiles/comparisons/<test1>-vs-<test2>.md`
75+
76+
### `collect`
77+
Manually collect a single heap profile.
78+
79+
```bash
80+
./hack/tools/e2e-profiling/e2e-profile.sh collect
81+
```
82+
83+
## Configuration
84+
85+
```bash
86+
# Namespace (default: olmv1-system)
87+
export E2E_PROFILE_NAMESPACE=olmv1-system
88+
89+
# Collection interval in seconds (default: 10)
90+
export E2E_PROFILE_INTERVAL=10
91+
92+
# CPU profiling duration in seconds (default: 10)
93+
export E2E_PROFILE_CPU_DURATION=10
94+
95+
# Profile mode: both, heap, cpu (default: both)
96+
export E2E_PROFILE_MODE=both
97+
98+
# Output directory (default: ./e2e-profiles)
99+
export E2E_PROFILE_DIR=./e2e-profiles
100+
101+
# Test target (default: test-experimental-e2e)
102+
export E2E_PROFILE_TEST_TARGET=test-experimental-e2e
103+
```
104+
105+
**Note:** If `CPU_DURATION >= INTERVAL`, CPU profiling runs continuously.
106+
107+
## Output Structure
108+
109+
```
110+
e2e-profiles/
111+
├── baseline/
112+
│ ├── operator-controller/
113+
│ │ ├── heap*.pprof # Heap snapshots
114+
│ │ └── cpu*.pprof # CPU profiles
115+
│ ├── catalogd/
116+
│ │ ├── heap*.pprof
117+
│ │ └── cpu*.pprof
118+
│ ├── test.log
119+
│ ├── collection.log
120+
│ └── analysis.md
121+
└── comparisons/
122+
└── baseline-vs-optimized.md
123+
```
124+
125+
## Examples
126+
127+
### Profile Optimization
128+
129+
```bash
130+
# Baseline
131+
make start-profiling/baseline
132+
make test-e2e
133+
make stop-profiling
134+
135+
# Implement changes, then profile optimized version
136+
make start-profiling/optimized
137+
make test-e2e
138+
make stop-profiling
139+
140+
# Compare
141+
./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized
142+
```
143+
144+
### Heap-Only Profiling
145+
146+
```bash
147+
# Reduced overhead for memory-focused analysis
148+
E2E_PROFILE_MODE=heap make start-profiling/memory-test
149+
make test-e2e
150+
make stop-profiling
151+
```
152+
153+
### Different Test Suites
154+
155+
```bash
156+
./hack/tools/e2e-profiling/e2e-profile.sh run standard test-e2e
157+
./hack/tools/e2e-profiling/e2e-profile.sh run upgrade test-upgrade-e2e
158+
./hack/tools/e2e-profiling/e2e-profile.sh compare standard upgrade
159+
```
160+
161+
## Interactive Analysis
162+
163+
```bash
164+
cd e2e-profiles/my-test/operator-controller
165+
166+
# Top allocators
167+
go tool pprof -top heap23.pprof
168+
169+
# Interactive mode
170+
go tool pprof heap23.pprof
171+
# Commands: top, list, web, pdf
172+
173+
# Compare snapshots
174+
go tool pprof -base=heap0.pprof -top heap23.pprof
175+
176+
# Filter specific patterns
177+
go tool pprof -text heap23.pprof | grep -i openapi
178+
```
179+
180+
## Troubleshooting
181+
182+
**No profiles collected:**
183+
- Check deployment is ready: `kubectl get deployment -n olmv1-system`
184+
- Verify pprof endpoint: `curl http://localhost:6060/debug/pprof/`
185+
- Review `collection.log` for connection errors
186+
187+
**Test exits early:**
188+
- Run test manually first to verify it works
189+
- Check `test.log` for errors
190+
191+
**Analysis fails:**
192+
- Verify files exist: `find e2e-profiles -name "*.pprof"`
193+
- Check `go tool pprof --help` works
194+
195+
**Port-forward issues:**
196+
- Test manually: `kubectl port-forward -n olmv1-system deployment/operator-controller-controller-manager 6060:6060`
197+
- Kill stuck processes: `pkill -f "kubectl port-forward.*6060"`
198+
199+
## Requirements
200+
201+
- kubectl (cluster access)
202+
- go (for `go tool pprof`)
203+
- make
204+
- curl
205+
- bash 4.0+
206+
207+
## Real-World Results
208+
209+
OpenAPI caching optimization:
210+
- **Memory:** 49.6 MB → 41.2 MB (-16.9%)
211+
- **OpenAPI allocations:** 13 MB → 3.5 MB (-73%)
212+
- **Key insight:** Repeated schema fetching was #1 memory consumer
213+
214+
## Architecture
215+
216+
**Scripts:**
217+
- `profile-collector-daemon.sh` - Background collection process
218+
- `start-profiling.sh` - Start daemon mode profiling
219+
- `stop-profiling.sh` - Stop daemon and cleanup
220+
- `run-profiled-test.sh` - Orchestrate test + profiling
221+
- `analyze-profiles.sh` - Generate analysis reports
222+
- `compare-profiles.sh` - Create comparison reports
223+
- `common.sh` - Shared utilities and logging
224+
225+
**Key Features:**
226+
- Deployment-based port-forwarding (survives pod restarts)
227+
- Automatic retry with 30s timeout
228+
- Graceful cleanup on exit/interrupt
229+
- Multi-component support (operator-controller + catalogd)
230+
231+
## See Also
232+
233+
- [Go pprof documentation](https://pkg.go.dev/net/http/pprof)
234+
- [Profiling Go Programs](https://go.dev/blog/pprof)

0 commit comments

Comments
 (0)