You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🤖 feat: add hysteresis-based adaptive concurrency for terminal-bench
Implements adaptive concurrency control for terminal-bench using a
burst-and-resume pattern that automatically adjusts parallelism based on
system load average.
## Key Features
- **Hysteresis-based adjustment**: Double concurrency when load < threshold,
halve when load > threshold
- **Burst-and-resume pattern**: Runs terminal-bench in bursts, using native
resume capability to skip completed tasks between bursts
- **Clean container lifecycle**: No mid-task interruption, each burst
completes naturally before adjusting
- **Configurable parameters**: Max concurrency, load threshold, check interval
## Implementation
- `benchmarks/terminal_bench/adaptive_bench.py`: Main wrapper implementing
burst-and-resume logic with load monitoring
- `benchmarks/terminal_bench/adaptive_bench_test.py`: Unit tests for adaptive
logic
- `Makefile`: New `benchmark-terminal-adaptive` target
- Documentation updates in `benchmarks/terminal_bench/README.md`
## Usage
```bash
# Start with concurrency=1, scale up to 16 based on load
TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
# Conservative: max 8, higher load threshold
TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive
# Sample 5 tasks with adaptive concurrency
TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive
```
## How It Works
1. Start with concurrency=1
2. Run terminal-bench burst with current concurrency
3. After burst completes, check 1-minute load average
4. Adjust concurrency: double if load < threshold, halve if load > threshold
5. Update tb.lock with new concurrency
6. Resume run (skips completed tasks automatically)
7. Repeat until all tasks complete
## Tradeoffs
- ✅ Automatically finds optimal concurrency for hardware
- ✅ Prevents system overload
- ✅ Uses terminal-bench native features (resume, tb.lock)
- ⚠️ Burst overhead ~2-5s (acceptable for 6+ minute avg task duration)
- ⚠️ Modifies tb.lock (semi-internal format, but stable)
## Design Rationale
Research showed terminal-bench uses fixed-size ThreadPoolExecutor that cannot
be resized mid-run. Kill-and-restart approach would interrupt Docker
containers mid-task. Burst-and-resume leverages terminal-bench's built-in
resume capability for clean checkpointing and task skipping.
_Generated with `cmux`_
Copy file name to clipboardExpand all lines: benchmarks/terminal_bench/README.md
+61Lines changed: 61 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,10 +99,71 @@ Based on analysis of the Oct 30 nightly run (15-minute timeout):
99
99
100
100
**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
101
101
102
+
## Adaptive Concurrency Mode
103
+
104
+
The `benchmark-terminal-adaptive` target automatically adjusts concurrency based on system load using a **burst-and-resume pattern**:
105
+
106
+
```bash
107
+
# Start with concurrency=1, scale up to max 16 based on load
108
+
TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
109
+
110
+
# More conservative: max 8, higher load threshold
111
+
TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive
112
+
113
+
# Faster adjustments: check every 30 seconds
114
+
TB_CHECK_INTERVAL=30 TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
115
+
116
+
# Sample 5 tasks with adaptive concurrency
117
+
TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive
118
+
```
119
+
120
+
### How It Works
121
+
122
+
1.**Runs terminal-bench in bursts** with current concurrency
123
+
2.**Monitors system load** after each burst completes
124
+
3.**Adjusts concurrency** using hysteresis:
125
+
-**Double** when 1-minute load avg < threshold
126
+
-**Halve** when 1-minute load avg > threshold
127
+
4.**Resumes** the run with updated concurrency
128
+
129
+
The burst-and-resume pattern leverages terminal-bench's native resume capability to skip completed tasks. Each burst runs to completion (no mid-task interruption), ensuring clean Docker container lifecycle.
130
+
131
+
### Configuration
132
+
133
+
| Variable | Default | Description |
134
+
|----------|---------|-------------|
135
+
|`TB_MAX_CONCURRENT`| 16 | Maximum concurrency limit |
136
+
|`TB_LOAD_THRESHOLD`| 1.0 | Load average threshold for adjusting concurrency |
137
+
|`TB_CHECK_INTERVAL`| 60 | Seconds to wait between bursts |
138
+
139
+
### When to Use Adaptive Mode
140
+
141
+
**Use adaptive mode when:**
142
+
- Running on shared hardware with variable load
143
+
- Unsure of optimal concurrency for your system
144
+
- Want to maximize throughput without overloading
145
+
- Running long benchmark suites (full 80-task suite)
146
+
147
+
**Use fixed concurrency when:**
148
+
- Running on dedicated hardware
149
+
- Know optimal concurrency for your setup
150
+
- Running small task samples (< 10 tasks)
151
+
- Burst overhead (2-5s) matters for very short tasks
0 commit comments