Skip to content

Commit fbc0b46

Browse files
committed
🤖 fix: parse 'results' field instead of 'trials' in results.json
Terminal-bench's results.json uses 'results' field, not 'trials'. This caused get_run_status() to always return completed=0, leading to infinite loops where the script would keep resuming even after all tasks were done. Tested locally with 3 tasks - script now correctly detects completion and exits.
1 parent acf3998 commit fbc0b46

File tree

3 files changed

+25
-41
lines changed

3 files changed

+25
-41
lines changed

benchmarks/terminal_bench/README.md

Lines changed: 23 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -105,63 +105,47 @@ Based on analysis of the Oct 30 nightly run (15-minute timeout):
105105

106106
**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
107107

108-
## Adaptive Concurrency Mode
108+
## Adaptive Concurrency
109109

110-
The `benchmark-terminal-adaptive` target automatically adjusts concurrency based on system load using a **burst-and-resume pattern**:
111-
112-
```bash
113-
# Start with concurrency=1, scale up to max 16 based on load
114-
TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
115-
116-
# More conservative: max 8, higher load threshold
117-
TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive
118-
119-
# Faster adjustments: check every 30 seconds
120-
TB_CHECK_INTERVAL=30 TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
121-
122-
# Sample 5 tasks with adaptive concurrency
123-
TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive
124-
```
110+
Terminal-bench uses **adaptive concurrency** that automatically scales from 1-16 concurrent tasks based on system load using a **burst-and-resume pattern**:
125111

126112
### How It Works
127113

128-
1. **Runs terminal-bench in bursts** with current concurrency
129-
2. **Monitors system load** after each burst completes
114+
1. **Starts with concurrency=1** and runs a burst
115+
2. **Monitors system load** (1-minute average) after each burst completes
130116
3. **Adjusts concurrency** using hysteresis:
131-
- **Double** when 1-minute load avg < threshold
132-
- **Halve** when 1-minute load avg > threshold
133-
4. **Resumes** the run with updated concurrency
117+
- **Double** when load < threshold (default: 1.0)
118+
- **Halve** when load > threshold
119+
- **Bounded to [1, 16]** for optimal performance
120+
4. **Resumes** the run with updated concurrency (skips completed tasks)
134121

135-
The burst-and-resume pattern leverages terminal-bench's native resume capability to skip completed tasks. Each burst runs to completion (no mid-task interruption), ensuring clean Docker container lifecycle.
122+
The burst-and-resume pattern leverages terminal-bench's native resume capability. Each burst runs to completion with no mid-task interruption, ensuring clean Docker container lifecycle.
136123

137124
### Configuration
138125

126+
```bash
127+
# Adjust load threshold (default: 1.0)
128+
TB_LOAD_THRESHOLD=2.0 make benchmark-terminal
129+
130+
# Faster adjustments (default: 60s between bursts)
131+
TB_CHECK_INTERVAL=30 make benchmark-terminal
132+
133+
# Sample 5 tasks with adaptive concurrency
134+
TB_SAMPLE_SIZE=5 make benchmark-terminal
135+
```
136+
139137
| Variable | Default | Description |
140138
|----------|---------|-------------|
141-
| `TB_MAX_CONCURRENT` | 16 | Maximum concurrency limit |
142139
| `TB_LOAD_THRESHOLD` | 1.0 | Load average threshold for adjusting concurrency |
143140
| `TB_CHECK_INTERVAL` | 60 | Seconds to wait between bursts |
144141

145-
### When to Use Adaptive Mode
146-
147-
**Use adaptive mode when:**
148-
- Running on shared hardware with variable load
149-
- Unsure of optimal concurrency for your system
150-
- Want to maximize throughput without overloading
151-
- Running long benchmark suites (full 80-task suite)
152-
153-
**Use fixed concurrency when:**
154-
- Running on dedicated hardware
155-
- Know optimal concurrency for your setup
156-
- Running small task samples (< 10 tasks)
157-
- Burst overhead (2-5s) matters for very short tasks
158-
159142
### Tradeoffs
160143

161-
- ✅ Automatically finds optimal concurrency
144+
- ✅ Automatically finds optimal concurrency for hardware
162145
- ✅ Prevents system overload
163146
- ✅ Clean container lifecycle (no mid-task kills)
164-
- ⚠️ Burst overhead (~2-5s between bursts)
147+
- ✅ Bounded to [1, 16] for safety
148+
- ⚠️ Burst overhead (~2-5s, negligible for 6+ min avg tasks)
165149
- ⚠️ Adjustment latency = burst duration + check interval
166150

167151
## Files

benchmarks/terminal_bench/adaptive_bench.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ def get_run_status(self) -> dict:
6969
results_data = json.load(f)
7070
# Count unique task_ids in results
7171
completed = len(
72-
set(r["task_id"] for r in results_data.get("trials", []))
72+
set(r["task_id"] for r in results_data.get("results", []))
7373
)
7474

7575
return {

benchmarks/terminal_bench/adaptive_bench_test.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ def test_get_run_status_with_results(self, mock_exists, mock_file):
173173

174174
# Mock results.json with 3 completed tasks
175175
results_data = {
176-
"trials": [
176+
"results": [
177177
{"task_id": "task1", "resolved": True},
178178
{"task_id": "task2", "resolved": False},
179179
{"task_id": "task3", "resolved": True},

0 commit comments

Comments
 (0)