Skip to content

Commit 6a5fd3f

Browse files
committed
🤖 feat: add hysteresis-based adaptive concurrency for terminal-bench
Implements adaptive concurrency control for terminal-bench using a burst-and-resume pattern that automatically adjusts parallelism based on system load average. ## Key Features - **Hysteresis-based adjustment**: Double concurrency when load < threshold, halve when load > threshold - **Burst-and-resume pattern**: Runs terminal-bench in bursts, using native resume capability to skip completed tasks between bursts - **Clean container lifecycle**: No mid-task interruption, each burst completes naturally before adjusting - **Configurable parameters**: Max concurrency, load threshold, check interval ## Implementation - `benchmarks/terminal_bench/adaptive_bench.py`: Main wrapper implementing burst-and-resume logic with load monitoring - `benchmarks/terminal_bench/adaptive_bench_test.py`: Unit tests for adaptive logic - `Makefile`: New `benchmark-terminal-adaptive` target - Documentation updates in `benchmarks/terminal_bench/README.md` ## Usage ```bash # Start with concurrency=1, scale up to 16 based on load TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive # Conservative: max 8, higher load threshold TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive # Sample 5 tasks with adaptive concurrency TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive ``` ## How It Works 1. Start with concurrency=1 2. Run terminal-bench burst with current concurrency 3. After burst completes, check 1-minute load average 4. Adjust concurrency: double if load < threshold, halve if load > threshold 5. Update tb.lock with new concurrency 6. Resume run (skips completed tasks automatically) 7. Repeat until all tasks complete ## Tradeoffs - ✅ Automatically finds optimal concurrency for hardware - ✅ Prevents system overload - ✅ Uses terminal-bench native features (resume, tb.lock) - ⚠️ Burst overhead ~2-5s (acceptable for 6+ minute avg task duration) - ⚠️ Modifies tb.lock (semi-internal format, but stable) ## Design Rationale Research showed terminal-bench uses fixed-size ThreadPoolExecutor that cannot be resized mid-run. Kill-and-restart approach would interrupt Docker containers mid-task. Burst-and-resume leverages terminal-bench's built-in resume capability for clean checkpointing and task skipping. _Generated with `cmux`_
1 parent 7ca32f4 commit 6a5fd3f

File tree

4 files changed

+649
-1
lines changed

4 files changed

+649
-1
lines changed

Makefile

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ include fmt.mk
3939
.PHONY: dist dist-mac dist-win dist-linux
4040
.PHONY: docs docs-build docs-watch
4141
.PHONY: storybook storybook-build test-storybook chromatic
42-
.PHONY: benchmark-terminal
42+
.PHONY: benchmark-terminal benchmark-terminal-adaptive
4343
.PHONY: ensure-deps
4444
.PHONY: check-eager-imports check-bundle-size check-startup
4545

@@ -329,6 +329,45 @@ benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB
329329
$$TASK_ID_FLAGS \
330330
$${TB_ARGS}
331331

332+
.PHONY: benchmark-terminal-adaptive
333+
benchmark-terminal-adaptive: ## Run Terminal-Bench with adaptive concurrency (use TB_MAX_CONCURRENT/TB_LOAD_THRESHOLD/TB_CHECK_INTERVAL)
334+
@TB_DATASET=$${TB_DATASET:-terminal-bench-core==0.1.1}; \
335+
TB_TIMEOUT=$${TB_TIMEOUT:-1800}; \
336+
TB_MAX_CONCURRENT=$${TB_MAX_CONCURRENT:-16}; \
337+
TB_LOAD_THRESHOLD=$${TB_LOAD_THRESHOLD:-1.0}; \
338+
TB_CHECK_INTERVAL=$${TB_CHECK_INTERVAL:-60}; \
339+
LIVESTREAM_FLAG=$${TB_LIVESTREAM:+--livestream}; \
340+
TASK_ID_FLAGS=""; \
341+
if [ -n "$$TB_SAMPLE_SIZE" ]; then \
342+
echo "Ensuring dataset $$TB_DATASET is downloaded..."; \
343+
uvx terminal-bench datasets download --dataset "$$TB_DATASET" 2>&1 | grep -v "already exists" || true; \
344+
echo "Sampling $$TB_SAMPLE_SIZE tasks from $$TB_DATASET..."; \
345+
TASK_IDS=$$(python3 benchmarks/terminal_bench/sample_tasks.py --dataset "$$TB_DATASET" --sample-size "$$TB_SAMPLE_SIZE" --format space) || { \
346+
echo "Error: Failed to sample tasks" >&2; \
347+
exit 1; \
348+
}; \
349+
if [ -z "$$TASK_IDS" ]; then \
350+
echo "Error: Sampling returned no task IDs" >&2; \
351+
exit 1; \
352+
fi; \
353+
for task_id in $$TASK_IDS; do \
354+
TASK_ID_FLAGS="$$TASK_ID_FLAGS --task-id $$task_id"; \
355+
done; \
356+
echo "Selected task IDs: $$TASK_IDS"; \
357+
fi; \
358+
echo "Running adaptive terminal-bench (max concurrency: $$TB_MAX_CONCURRENT, load threshold: $$TB_LOAD_THRESHOLD)"; \
359+
python3 benchmarks/terminal_bench/adaptive_bench.py \
360+
--max-concurrent $$TB_MAX_CONCURRENT \
361+
--load-threshold $$TB_LOAD_THRESHOLD \
362+
--check-interval $$TB_CHECK_INTERVAL \
363+
-- \
364+
--dataset "$$TB_DATASET" \
365+
--agent-import-path benchmarks.terminal_bench.cmux_agent:CmuxAgent \
366+
--global-agent-timeout-sec $$TB_TIMEOUT \
367+
$$LIVESTREAM_FLAG \
368+
$$TASK_ID_FLAGS \
369+
$${TB_ARGS}
370+
332371
## Clean
333372
clean: ## Clean build artifacts
334373
@echo "Cleaning build artifacts..."

benchmarks/terminal_bench/README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,10 +99,71 @@ Based on analysis of the Oct 30 nightly run (15-minute timeout):
9999

100100
**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
101101

102+
## Adaptive Concurrency Mode
103+
104+
The `benchmark-terminal-adaptive` target automatically adjusts concurrency based on system load using a **burst-and-resume pattern**:
105+
106+
```bash
107+
# Start with concurrency=1, scale up to max 16 based on load
108+
TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
109+
110+
# More conservative: max 8, higher load threshold
111+
TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive
112+
113+
# Faster adjustments: check every 30 seconds
114+
TB_CHECK_INTERVAL=30 TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
115+
116+
# Sample 5 tasks with adaptive concurrency
117+
TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive
118+
```
119+
120+
### How It Works
121+
122+
1. **Runs terminal-bench in bursts** with current concurrency
123+
2. **Monitors system load** after each burst completes
124+
3. **Adjusts concurrency** using hysteresis:
125+
- **Double** when 1-minute load avg < threshold
126+
- **Halve** when 1-minute load avg > threshold
127+
4. **Resumes** the run with updated concurrency
128+
129+
The burst-and-resume pattern leverages terminal-bench's native resume capability to skip completed tasks. Each burst runs to completion (no mid-task interruption), ensuring clean Docker container lifecycle.
130+
131+
### Configuration
132+
133+
| Variable | Default | Description |
134+
|----------|---------|-------------|
135+
| `TB_MAX_CONCURRENT` | 16 | Maximum concurrency limit |
136+
| `TB_LOAD_THRESHOLD` | 1.0 | Load average threshold for adjusting concurrency |
137+
| `TB_CHECK_INTERVAL` | 60 | Seconds to wait between bursts |
138+
139+
### When to Use Adaptive Mode
140+
141+
**Use adaptive mode when:**
142+
- Running on shared hardware with variable load
143+
- Unsure of optimal concurrency for your system
144+
- Want to maximize throughput without overloading
145+
- Running long benchmark suites (full 80-task suite)
146+
147+
**Use fixed concurrency when:**
148+
- Running on dedicated hardware
149+
- Know optimal concurrency for your setup
150+
- Running small task samples (< 10 tasks)
151+
- Burst overhead (2-5s) matters for very short tasks
152+
153+
### Tradeoffs
154+
155+
- ✅ Automatically finds optimal concurrency
156+
- ✅ Prevents system overload
157+
- ✅ Clean container lifecycle (no mid-task kills)
158+
- ⚠️ Burst overhead (~2-5s between bursts)
159+
- ⚠️ Adjustment latency = burst duration + check interval
160+
102161
## Files
103162

104163
- `cmux_agent.py`: Main agent adapter implementing Terminal-Bench's agent interface
105164
- `cmux-run.sh`: Shell script that sets up environment and invokes cmux CLI
106165
- `cmux_payload.py`: Helper to package cmux app for containerized execution
107166
- `cmux_setup.sh.j2`: Jinja2 template for agent installation script
108167
- `sample_tasks.py`: Utility to randomly sample tasks from dataset
168+
- `adaptive_bench.py`: Adaptive concurrency wrapper using burst-and-resume pattern
169+
- `adaptive_bench_test.py`: Unit tests for adaptive_bench.py

0 commit comments

Comments
 (0)