Skip to content
Open
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
c7d1454
Add GAIA eval_infer for unified evaluation workflow
openhands-agent Dec 2, 2025
2f430c2
Add remote workspace support to GAIA evaluation
simonrosenberg Dec 3, 2025
58f299a
Fix GAIA evaluation to use workspace_type from command-line argument
simonrosenberg Dec 3, 2025
54c13a6
Add LocalWorkspace support for GAIA evaluations
simonrosenberg Dec 3, 2025
02fd3ba
Add 'local' to workspace type choices in argument parser
simonrosenberg Dec 3, 2025
7254a69
Add 'local' to workspace_type Literal in EvalMetadata model
simonrosenberg Dec 3, 2025
a7d0246
Add GAIA agent server image build and use remote workspace
simonrosenberg Dec 3, 2025
35b3985
Add compatibility inputs to GAIA build workflow
simonrosenberg Dec 3, 2025
2f93f40
Fix SWE-bench to use DockerDevWorkspace for base_image/target
simonrosenberg Dec 3, 2025
1a56035
Fix pyright type error in load_hf_dataset.py
simonrosenberg Dec 3, 2025
33a83f1
Simplify GAIA build workflow for single-image architecture
simonrosenberg Dec 3, 2025
b2e1e54
Rename workflow to singular: build-gaia-image.yml
simonrosenberg Dec 3, 2025
6822653
Fix YAML syntax errors in build-gaia-image.yml
openhands-agent Dec 4, 2025
f794aba
Fix pickle error in GAIA build by replacing lambda with regular function
openhands-agent Dec 4, 2025
f679baa
Add fallback Docker buildx setup when Blacksmith fails
openhands-agent Dec 4, 2025
0d61966
Always set up docker-container builder as fallback
openhands-agent Dec 4, 2025
ae4d2ff
[TEMPORARY] Disable Tavily requirement for GAIA testing
openhands-agent Dec 4, 2025
b0afa1f
Add format_report.py formatters for GAIA and SWE-bench
openhands-agent Dec 4, 2025
def87bc
Update formatters to read output.jsonl and report.json directly
openhands-agent Dec 4, 2025
9468988
Fix GAIA timeout issues by pre-installing MCP server
openhands-agent Dec 4, 2025
89b77c6
Add next steps documentation for MCP fix
openhands-agent Dec 4, 2025
0120ad6
Merge main into feature branch - resolve conflicts
openhands-agent Dec 4, 2025
4db1e81
Revert temporary Tavily disable - restore full functionality
openhands-agent Dec 4, 2025
9e13bb2
Add comprehensive workflow status documentation
openhands-agent Dec 4, 2025
13af333
Add MCP-enhanced image build to GAIA workflow
openhands-agent Dec 4, 2025
7170493
Add workflow run summary documentation
openhands-agent Dec 4, 2025
e814424
Refresh workflow cache - add descriptive comment
openhands-agent Dec 4, 2025
a153769
Remove redundant build-gaia-mcp-image.yml workflow
openhands-agent Dec 4, 2025
5749490
Force workflow cache refresh
openhands-agent Dec 4, 2025
e8ac276
Rename workflow to avoid GitHub Actions cache issue
openhands-agent Dec 4, 2025
b362d9d
Fix YAML syntax error: collapse multi-line Python code
openhands-agent Dec 4, 2025
60d6e17
Fix YAML syntax: replace heredoc with direct string assignment
openhands-agent Dec 4, 2025
cbe5f81
Fix YAML syntax: use jq for comment body to avoid multi-line string i…
openhands-agent Dec 4, 2025
51bd224
Add fallback Docker Buildx setup when Blacksmith fails
openhands-agent Dec 4, 2025
c5cc86c
Replace Blacksmith with standard Docker Buildx setup
openhands-agent Dec 4, 2025
6614af7
Fix GAIA evaluation: Use binary target instead of binary-minimal to i…
openhands-agent Dec 5, 2025
98e0965
Remove unnecessary documentation files
openhands-agent Dec 5, 2025
02f31bc
Remove unused code and fix workflow options
openhands-agent Dec 5, 2025
852b64c
Merge main into openhands/multi-benchmark-eval-support
openhands-agent Dec 5, 2025
7ee9082
Fix trailing whitespace in format_report.py files
openhands-agent Dec 5, 2025
e076657
Revert swt_bench/run_infer.py to main version - no functional changes…
openhands-agent Dec 5, 2025
bc65fbc
Remove outdated workflow comment
openhands-agent Dec 5, 2025
2af2bf8
Add docker workspace support to GAIA evaluation
openhands-agent Dec 5, 2025
1a812ea
Fix GAIA workspace: keep docker mode behavior same as main, only add …
openhands-agent Dec 6, 2025
4c8a9d6
Update SDK submodule to latest main (693c3261) to match evaluation ru…
openhands-agent Dec 6, 2025
f5d612f
Fix Browser action deserialization by using OpenHandsModel
openhands-agent Dec 6, 2025
233b79e
improve: enhance agent output extraction and ffmpeg installation
openhands-agent Dec 7, 2025
9533cf2
Fix critical logging bug: Failed instances now properly tracked and r…
openhands-agent Dec 7, 2025
b5cf5d9
[TEMPORARY] Hardcode 2 failed GAIA task IDs for debugging
openhands-agent Dec 7, 2025
b3ab6f5
Add flexible instance_ids parameter to GAIA evaluation
openhands-agent Dec 8, 2025
12da617
Fix error counting in GAIA evaluation
openhands-agent Dec 8, 2025
5bbb4d3
Fix pre-commit issues: formatting and type checking
openhands-agent Dec 8, 2025
cf5d990
Add instance-ids support for SWE-bench evaluation
openhands-agent Dec 9, 2025
f32654f
Fix duplicate --instance-ids argument in GAIA evaluation
openhands-agent Dec 9, 2025
f6be32a
Add SWT-Bench image build workflow and supporting scripts
openhands-agent Dec 9, 2025
cff751d
Use useblacksmith/setup-docker-builder@v1 for Docker Buildx
openhands-agent Dec 9, 2025
b2829f3
Merge main into openhands/multi-benchmark-eval-support
openhands-agent Dec 10, 2025
7d2e267
Fix SWT-Bench workflow: correct folder path from swt_bench to swtbench
openhands-agent Dec 10, 2025
27300d8
Fix SWT-Bench workflow: use Blacksmith for Docker setup
openhands-agent Dec 10, 2025
9b4cf76
Fix SWT-Bench: use swebench Docker namespace (not swtbench)
openhands-agent Dec 10, 2025
70dab0e
Fix SWT-Bench workflow: use Blacksmith runner
openhands-agent Dec 10, 2025
3832fe4
Fix GAIA workflow: use Blacksmith runner for Docker caching
openhands-agent Dec 10, 2025
54e4ebf
Fix SWT-Bench evaluation: start Docker daemon automatically
openhands-agent Dec 10, 2025
386b58d
Fix Docker daemon startup: remove sudo (container runs as root)
openhands-agent Dec 10, 2025
70d8dda
Rename build-gaia-eval-image.yml back to build-gaia-image.yml
openhands-agent Dec 10, 2025
938f72b
Sync GAIA workflow with main branch
openhands-agent Dec 10, 2025
0fd0fc1
Revert "Sync GAIA workflow with main branch"
simonrosenberg Dec 10, 2025
d9be9a0
Add local copy of k8s deploy workflow
openhands-agent Dec 10, 2025
a7231d3
Skip gaia build if available
simonrosenberg Dec 11, 2025
9de1000
Report GAIA failures as errors
simonrosenberg Dec 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,6 @@ on:
description: 'Software Agent SDK commit/ref to use'
required: true
type: string
target:
description: 'Build target (default: binary-minimal)'
required: false
default: 'binary-minimal'
type: choice
options:
- binary-minimal
- source-minimal

concurrency:
group: build-gaia-${{ github.ref }}
Expand Down Expand Up @@ -65,7 +57,7 @@ jobs:
git add vendor/software-agent-sdk
echo "Updated SDK submodule to $SDK_SHA (from ${{ inputs.sdk-commit }})"

- name: Set up Docker Buildx with Blacksmith
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1

- name: Log in to GitHub Container Registry
Expand All @@ -88,7 +80,8 @@ jobs:
run: |
set -euo pipefail

TARGET="${{ inputs.target || 'binary-minimal' }}"
# GAIA requires 'binary' target to include Chromium for browser operations
TARGET="binary"

CMD="uv run benchmarks/gaia/build_images.py \
--image ghcr.io/openhands/eval-agent-server \
Expand All @@ -101,6 +94,39 @@ jobs:
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain

- name: Build and push GAIA image with MCP pre-installed
run: |
set -euo pipefail

# Get the SDK commit SHA for tagging
SDK_SHA=$(git submodule status vendor/software-agent-sdk | awk '{print $1}' | sed 's/^[+-]//' | cut -c1-7)

# GAIA requires 'binary' target to include Chromium for browser operations
TARGET="binary"

# Compute base and MCP image tags
BASE_IMAGE="ghcr.io/openhands/eval-agent-server:${SDK_SHA}-gaia"
MCP_IMAGE="ghcr.io/openhands/eval-agent-server:${SDK_SHA}-gaia-with-mcp"

echo "Building MCP-enhanced image..."
echo " Base image: ${BASE_IMAGE}"
echo " MCP image: ${MCP_IMAGE}"

# Build the derived image with MCP pre-cached
docker build \
-f benchmarks/gaia/Dockerfile.gaia \
--build-arg SDK_IMAGE="${BASE_IMAGE}" \
-t "${MCP_IMAGE}" \
.

# Push the image
docker push "${MCP_IMAGE}"

echo "✅ MCP-enhanced image built and pushed: ${MCP_IMAGE}"
env:
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain

- name: Archive build logs
if: always()
run: |
Expand Down Expand Up @@ -157,6 +183,7 @@ jobs:
run: |
# Get SDK version
SDK_SHA=$(git submodule status vendor/software-agent-sdk | awk '{print $1}' | sed 's/^[+-]//')
SDK_SHA_SHORT=${SDK_SHA:0:7}

# Read the single manifest file
MANIFEST_FILE=$(find builds -name "manifest.jsonl" -type f 2>/dev/null | head -1 || true)
Expand All @@ -167,18 +194,16 @@ jobs:
fi

# Extract the image tag from the manifest
IMAGE_TAG=$(cat "$MANIFEST_FILE" | python3 -c "
import sys, json
data = json.loads(sys.stdin.read())
tags = data.get('tags', [])
print(tags[0] if tags else 'unknown')
")
IMAGE_TAG=$(cat "$MANIFEST_FILE" | python3 -c "import sys, json; data = json.loads(sys.stdin.read()); tags = data.get('tags', []); print(tags[0] if tags else 'unknown')")

if [ "$IMAGE_TAG" = "unknown" ]; then
echo "No valid image tag found in manifest"
exit 0
fi

# Construct MCP image tag (always binary for GAIA)
MCP_IMAGE_TAG="ghcr.io/openhands/eval-agent-server:${SDK_SHA_SHORT}-gaia-with-mcp"

# Determine trigger source
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
TRIGGER="Manual trigger (workflow_dispatch)"
Expand All @@ -188,22 +213,23 @@ print(tags[0] if tags else 'unknown')
TRIGGER="${{ github.event_name }}"
fi

# Post comment
COMMENT_BODY=$(cat <<EOF
## GAIA Image Build Complete ✅

**SDK Version:** [\`${SDK_SHA:0:7}\`](https://github.com/OpenHands/software-agent-sdk/commit/${SDK_SHA})
**Image Tag:** \`${IMAGE_TAG}\`
**Workflow Run:** [#${{ github.run_id }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }})
**Triggered by:** ${TRIGGER}
EOF
)

# Post comment using jq to properly handle multi-line content
jq -n \
--arg sdk_short "${SDK_SHA_SHORT}" \
--arg sdk_full "${SDK_SHA}" \
--arg image "${IMAGE_TAG}" \
--arg mcp_image "${MCP_IMAGE_TAG}" \
--arg run_id "${{ github.run_id }}" \
--arg server_url "${{ github.server_url }}" \
--arg repo "${{ github.repository }}" \
--arg trigger "${TRIGGER}" \
'{body: "## GAIA Image Build Complete ✅\n\n**SDK Version:** [`\($sdk_short)`](https://github.com/OpenHands/software-agent-sdk/commit/\($sdk_full))\n**Base Image:** `\($image)`\n**MCP Image:** `\($mcp_image)` ⚡ _(MCP server pre-cached)_\n**Workflow Run:** [#\($run_id)](\($server_url)/\($repo)/actions/runs/\($run_id))\n**Triggered by:** \($trigger)\n\nThe MCP-enhanced image includes pre-cached `mcp-server-fetch` to eliminate 1-18 minute startup delays."}' | \
curl -L -X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Content-Type: application/json" \
"${{ github.api_url }}/repos/${{ github.repository }}/issues/81/comments" \
-d "$(jq -n --arg body "$COMMENT_BODY" '{body: $body}')"
-d @-
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
218 changes: 218 additions & 0 deletions .github/workflows/build-swt-bench-images.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
name: Build SWT-Bench Images

on:
pull_request_target:
types: [labeled]
workflow_dispatch:
inputs:
dataset:
description: 'Dataset name (e.g., princeton-nlp/SWE-bench_Verified)'
required: true
default: 'princeton-nlp/SWE-bench_Verified'
type: string
split:
description: 'Dataset split'
required: true
default: 'test'
type: string
max-workers:
description: 'Maximum number of parallel workers'
required: false
default: '4'
type: string
n-limit:
description: 'Limit number of images to build (0 for all)'
required: false
default: '0'
type: string
sdk-commit:
description: 'Software Agent SDK commit/ref to use'
required: true
type: string

concurrency:
group: build-swt-bench-${{ github.ref }}
cancel-in-progress: false

jobs:
build-and-push:
if: >
github.event_name == 'workflow_dispatch' ||
(github.event_name == 'pull_request_target' &&
github.event.label.name == 'build-swt-bench')

runs-on:
labels: ubuntu-latest

permissions:
contents: read
packages: write
issues: write

steps:
- name: Determine checkout ref
id: checkout-ref
run: |
if [ -n "${{ github.event.pull_request.head.sha }}" ]; then
echo "ref=${{ github.event.pull_request.head.sha }}" >> "$GITHUB_OUTPUT"
echo "Using PR head SHA: ${{ github.event.pull_request.head.sha }}"
else
echo "ref=" >> "$GITHUB_OUTPUT"
echo "Using default ref (the commit that triggered this workflow)"
fi

- uses: actions/checkout@v4
with:
ref: ${{ steps.checkout-ref.outputs.ref }}
submodules: recursive

- name: Update SDK submodule
if: ${{ github.event_name == 'workflow_dispatch' && inputs.sdk-commit != '' }}
run: |
cd vendor/software-agent-sdk
git fetch origin ${{ inputs.sdk-commit }}
git checkout FETCH_HEAD
SDK_SHA=$(git rev-parse HEAD)
cd ../..
git add vendor/software-agent-sdk
echo "Updated SDK submodule to $SDK_SHA (from ${{ inputs.sdk-commit }})"

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Install uv
uses: astral-sh/setup-uv@v7
with:
enable-cache: true

- name: Install dependencies
run: |
make build

- name: Build and push SWT-Bench images
run: |
set -euo pipefail

# Get inputs with defaults
DATASET="${{ inputs.dataset || 'princeton-nlp/SWE-bench_Verified' }}"
SPLIT="${{ inputs.split || 'test' }}"
MAX_WORKERS="${{ inputs.max-workers || '4' }}"
N_LIMIT="${{ inputs.n-limit || '0' }}"

# SWT-Bench uses source-minimal target (same as SWE-bench)
TARGET="source-minimal"

CMD="uv run benchmarks/swt_bench/build_images.py \
--dataset ${DATASET} \
--split ${SPLIT} \
--image ghcr.io/openhands/eval-agent-server \
--target ${TARGET} \
--max-workers ${MAX_WORKERS} \
--push"

# Add n-limit if specified
if [ "$N_LIMIT" != "0" ]; then
CMD="$CMD --n-limit ${N_LIMIT}"
fi

echo "Running: $CMD"
eval "$CMD"
env:
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain

- name: Archive build logs
if: always()
run: |
if [ -d builds ]; then
tar -czf build-logs.tar.gz builds/
echo "Build logs archived successfully"
else
echo "No builds directory found"
fi

- name: Upload build logs
if: always()
uses: actions/upload-artifact@v4
with:
name: build-logs-${{ github.run_id }}
path: build-logs.tar.gz
retention-days: 7
if-no-files-found: warn

- name: Display build summary
if: always()
run: |
MANIFEST_FILE=$(find builds -name "manifest.jsonl" -type f 2>/dev/null | head -1 || true)

if [ -z "$MANIFEST_FILE" ]; then
echo "## Build Summary" >> "$GITHUB_STEP_SUMMARY"
echo "❌ Build failed - no manifest found" >> "$GITHUB_STEP_SUMMARY"
exit 0
fi

# Count total images built
TOTAL_IMAGES=$(wc -l < "$MANIFEST_FILE")
SUCCESS_COUNT=$(grep -c '"error":null' "$MANIFEST_FILE" || echo 0)
FAIL_COUNT=$((TOTAL_IMAGES - SUCCESS_COUNT))

echo "## SWT-Bench Image Build Summary" >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"
echo "**Total Images:** $TOTAL_IMAGES" >> "$GITHUB_STEP_SUMMARY"
echo "**Successful:** $SUCCESS_COUNT ✅" >> "$GITHUB_STEP_SUMMARY"
echo "**Failed:** $FAIL_COUNT ❌" >> "$GITHUB_STEP_SUMMARY"

- name: Comment on tracker issue
if: success()
run: |
# Get SDK version
SDK_SHA=$(git submodule status vendor/software-agent-sdk | awk '{print $1}' | sed 's/^[+-]//')
SDK_SHA_SHORT=${SDK_SHA:0:7}

# Read build summary
MANIFEST_FILE=$(find builds -name "manifest.jsonl" -type f 2>/dev/null | head -1 || true)

if [ -z "$MANIFEST_FILE" ]; then
echo "No manifest file found"
exit 0
fi

TOTAL_IMAGES=$(wc -l < "$MANIFEST_FILE")
SUCCESS_COUNT=$(grep -c '"error":null' "$MANIFEST_FILE" || echo 0)

# Determine trigger source
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
TRIGGER="Manual trigger (workflow_dispatch)"
elif [ "${{ github.event_name }}" = "pull_request" ]; then
TRIGGER="Pull request [#${{ github.event.pull_request.number }}](${{ github.event.pull_request.html_url }})"
else
TRIGGER="${{ github.event_name }}"
fi

# Post comment using jq to properly handle multi-line content
jq -n \
--arg sdk_short "${SDK_SHA_SHORT}" \
--arg sdk_full "${SDK_SHA}" \
--arg total "$TOTAL_IMAGES" \
--arg success "$SUCCESS_COUNT" \
--arg run_id "${{ github.run_id }}" \
--arg server_url "${{ github.server_url }}" \
--arg repo "${{ github.repository }}" \
--arg trigger "${TRIGGER}" \
'{body: "## SWT-Bench Image Build Complete ✅\n\n**SDK Version:** [`\($sdk_short)`](https://github.com/OpenHands/software-agent-sdk/commit/\($sdk_full))\n**Images Built:** \($success)/\($total)\n**Workflow Run:** [#\($run_id)](\($server_url)/\($repo)/actions/runs/\($run_id))\n**Triggered by:** \($trigger)\n\nSWT-Bench images have been built and pushed to ghcr.io/openhands/eval-agent-server."}' | \
curl -L -X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Content-Type: application/json" \
"${{ github.api_url }}/repos/${{ github.repository }}/issues/81/comments" \
-d @-
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
17 changes: 17 additions & 0 deletions benchmarks/gaia/Dockerfile.gaia
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Dockerfile for GAIA evaluation with MCP server pre-installed
# Extends the base SDK image to pre-cache mcp-server-fetch and eliminate startup delays

ARG SDK_IMAGE=ghcr.io/openhands/eval-agent-server:f715937-gaia-binary-minimal
FROM ${SDK_IMAGE}

# Switch to root to install packages
USER root

# Pre-install MCP server to avoid 1-18 minute startup delays during agent initialization
# This caches the mcp-server-fetch package so uvx can start it instantly at runtime
RUN uvx mcp-server-fetch --version 2>&1 || echo "MCP server cached"

# Switch back to openhands user
USER openhands

# Inherit all other settings from base image
Loading