You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🤖 fix: parse 'results' field instead of 'trials' in results.json
Terminal-bench's results.json uses 'results' field, not 'trials'. This caused
get_run_status() to always return completed=0, leading to infinite loops where
the script would keep resuming even after all tasks were done.
Tested locally with 3 tasks - script now correctly detects completion and exits.
Copy file name to clipboardExpand all lines: benchmarks/terminal_bench/README.md
+23-39Lines changed: 23 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -105,63 +105,47 @@ Based on analysis of the Oct 30 nightly run (15-minute timeout):
105
105
106
106
**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
107
107
108
-
## Adaptive Concurrency Mode
108
+
## Adaptive Concurrency
109
109
110
-
The `benchmark-terminal-adaptive` target automatically adjusts concurrency based on system load using a **burst-and-resume pattern**:
111
-
112
-
```bash
113
-
# Start with concurrency=1, scale up to max 16 based on load
114
-
TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
115
-
116
-
# More conservative: max 8, higher load threshold
117
-
TB_MAX_CONCURRENT=8 TB_LOAD_THRESHOLD=2.0 make benchmark-terminal-adaptive
118
-
119
-
# Faster adjustments: check every 30 seconds
120
-
TB_CHECK_INTERVAL=30 TB_MAX_CONCURRENT=16 make benchmark-terminal-adaptive
121
-
122
-
# Sample 5 tasks with adaptive concurrency
123
-
TB_SAMPLE_SIZE=5 TB_MAX_CONCURRENT=8 make benchmark-terminal-adaptive
124
-
```
110
+
Terminal-bench uses **adaptive concurrency** that automatically scales from 1-16 concurrent tasks based on system load using a **burst-and-resume pattern**:
125
111
126
112
### How It Works
127
113
128
-
1.**Runs terminal-bench in bursts**with current concurrency
129
-
2.**Monitors system load** after each burst completes
114
+
1.**Starts with concurrency=1**and runs a burst
115
+
2.**Monitors system load**(1-minute average) after each burst completes
130
116
3.**Adjusts concurrency** using hysteresis:
131
-
-**Double** when 1-minute load avg < threshold
132
-
-**Halve** when 1-minute load avg > threshold
133
-
4.**Resumes** the run with updated concurrency
117
+
-**Double** when load < threshold (default: 1.0)
118
+
-**Halve** when load > threshold
119
+
-**Bounded to [1, 16]** for optimal performance
120
+
4.**Resumes** the run with updated concurrency (skips completed tasks)
134
121
135
-
The burst-and-resume pattern leverages terminal-bench's native resume capability to skip completed tasks. Each burst runs to completion (no mid-task interruption), ensuring clean Docker container lifecycle.
122
+
The burst-and-resume pattern leverages terminal-bench's native resume capability. Each burst runs to completion with no mid-task interruption, ensuring clean Docker container lifecycle.
136
123
137
124
### Configuration
138
125
126
+
```bash
127
+
# Adjust load threshold (default: 1.0)
128
+
TB_LOAD_THRESHOLD=2.0 make benchmark-terminal
129
+
130
+
# Faster adjustments (default: 60s between bursts)
131
+
TB_CHECK_INTERVAL=30 make benchmark-terminal
132
+
133
+
# Sample 5 tasks with adaptive concurrency
134
+
TB_SAMPLE_SIZE=5 make benchmark-terminal
135
+
```
136
+
139
137
| Variable | Default | Description |
140
138
|----------|---------|-------------|
141
-
|`TB_MAX_CONCURRENT`| 16 | Maximum concurrency limit |
142
139
|`TB_LOAD_THRESHOLD`| 1.0 | Load average threshold for adjusting concurrency |
143
140
|`TB_CHECK_INTERVAL`| 60 | Seconds to wait between bursts |
144
141
145
-
### When to Use Adaptive Mode
146
-
147
-
**Use adaptive mode when:**
148
-
- Running on shared hardware with variable load
149
-
- Unsure of optimal concurrency for your system
150
-
- Want to maximize throughput without overloading
151
-
- Running long benchmark suites (full 80-task suite)
152
-
153
-
**Use fixed concurrency when:**
154
-
- Running on dedicated hardware
155
-
- Know optimal concurrency for your setup
156
-
- Running small task samples (< 10 tasks)
157
-
- Burst overhead (2-5s) matters for very short tasks
158
-
159
142
### Tradeoffs
160
143
161
-
- ✅ Automatically finds optimal concurrency
144
+
- ✅ Automatically finds optimal concurrency for hardware
162
145
- ✅ Prevents system overload
163
146
- ✅ Clean container lifecycle (no mid-task kills)
164
-
- ⚠️ Burst overhead (~2-5s between bursts)
147
+
- ✅ Bounded to [1, 16] for safety
148
+
- ⚠️ Burst overhead (~2-5s, negligible for 6+ min avg tasks)
0 commit comments