Skip to content

Commit 612b23a

Browse files
authored
Merge pull request #1 from metauto-ai/release/initialize
release agent-as-a-judge
2 parents 163a00d + d625bb4 commit 612b23a

File tree

443 files changed

+45710
-5
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

443 files changed

+45710
-5
lines changed

.config/pre-commit-config.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v4.0.1
4+
hooks:
5+
- id: trailing-whitespace
6+
- id: end-of-file-fixer
7+
- id: check-yaml
8+
- id: check-json
9+
- repo: https://github.com/pre-commit/mirrors-mypy
10+
rev: v0.910
11+
hooks:
12+
- id: mypy
13+
additional_dependencies: ['types-termcolor']
14+
language: python
15+
entry: poetry run mypy

.env.sample

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
DEFAULT_LLM="gpt-4o-2024-08-06"
2+
OPENAI_API_KEY="sk-***"
3+
PROJECT_DIR="{PATH_TO_THIS_PROJECT}"

.github/.codecov.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
codecov:
2+
notify:
3+
wait_for_ci: true
4+
5+
coverage:
6+
status:
7+
patch:
8+
default:
9+
threshold: 100%
10+
project:
11+
default:
12+
threshold: 5%
13+
comment: false
14+
github_checks:
15+
annotations: false

.github/dependabot.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
version: 2
2+
updates:
3+
- package-ecosystem: "pip"
4+
directory: "/"
5+
schedule:
6+
interval: "weekly"
7+
open-pull-requests-limit: 5
8+
assignees:
9+
- mczhuge
10+
labels:
11+
- "dependencies"

.gitignore

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -106,10 +106,8 @@ ipython_config.py
106106
#pdm.lock
107107
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108108
# in version control.
109-
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
109+
# https://pdm.fming.dev/#use-with-ide
110110
.pdm.toml
111-
.pdm-python
112-
.pdm-build/
113111

114112
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
115113
__pypackages__/
@@ -160,3 +158,15 @@ cython_debug/
160158
# and can be added to the global gitignore or merged into this file. For a more nuclear
161159
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
162160
#.idea/
161+
162+
# log
163+
workspace/
164+
first_test/
165+
166+
# Cache
167+
Dockerfile
168+
.DS_Store
169+
170+
171+
# pycache
172+
*/__pycache__/

README.md

Lines changed: 115 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,115 @@
1-
# agent-as-a-judge
2-
🤠 Agent-as-a-Judge and DevAI dataset
1+
<div align="center">
2+
<h1 align="center">Agents Evaluate Agents</h1>
3+
<img src="assets/devai_logo.png" alt="DevAI Logo" width="150" height="150">
4+
<p align="center">
5+
<a href="https://devai.tech"><b>Project</b></a> |
6+
<a href="https://huggingface.co/DEVAI-benchmark"><b>Dataset</b></a> |
7+
<a href="https://arxiv.org/pdf/2410.10934"><b>Paper</b></a>
8+
</p>
9+
</div>
10+
11+
> [!NOTE]
12+
> Current evaluation techniques are often inadequate for advanced **agentic systems** due to their focus on final outcomes and labor-intensive manual reviews. To overcome this limitation, we introduce the **Agent-as-a-Judge** framework.
13+
>
14+
15+
## 🤠 Features
16+
17+
Agent-as-a-Judge offers two key advantages:
18+
19+
- **Automated Evaluation**: Agent-as-a-Judge can evaluate tasks during or after execution, saving 97.72% of time and 97.64% of costs compared to human experts.
20+
- **Provide Reward Signals**: It provides continuous, step-by-step feedback that can be used as reward signals for further agentic training and improvement.
21+
22+
<div align="center">
23+
<img src="assets/demo.gif" alt="Demo GIF" style="width: 100%; max-width: 650px;">
24+
</div>
25+
<div align="center">
26+
<img src="assets/judge_first.png" alt="AaaJ" style="width: 95%; max-width: 650px;">
27+
</div>
28+
29+
30+
31+
## 🎮 Quick Start
32+
33+
### 1. install
34+
35+
```python
36+
git clone https://github.com/metauto-ai/agent-as-a-judge.git
37+
cd agent-as-a-judge/
38+
conda create -n aaaj python=3.11
39+
conda activate aaaj
40+
pip install poetry
41+
poetry install
42+
```
43+
44+
### 2. set LLM&API
45+
46+
Before running, rename `.env.sample` to `.env` and fill in the **required APIs and Settings** in the main repo folder to support LLM calling. The `LiteLLM` tool supports various LLMs.
47+
48+
### 3. run
49+
50+
> [!TIP]
51+
> See more comprehensive [usage scripts](scripts/README.md).
52+
>
53+
54+
55+
#### Usage A: **Ask Anything** for Any Workspace:
56+
57+
```python
58+
59+
PYTHONPATH=. python scripts/run_ask.py \
60+
--workspace $(pwd)/benchmark/workspaces/OpenHands/39_Drug_Response_Prediction_SVM_GDSC_ML \
61+
--question "What does this workspace contain?"
62+
```
63+
64+
You can find an [example](assets/ask_sample.md) to see how **Ask Anything** works.
65+
66+
67+
#### Usage B: **Agent-as-a-Judge** for **DevAI**
68+
69+
70+
```python
71+
72+
PYTHONPATH=. python scripts/run_aaaj.py \
73+
--developer_agent "OpenHands" \
74+
--setting "black_box" \
75+
--planning "efficient (no planning)" \
76+
--benchmark_dir $(pwd)/benchmark
77+
```
78+
79+
💡 There is an [example](assets/aaaj_sample.md) that shows the process of how **Agent-as-a-Judge** collects evidence for judging.
80+
81+
82+
83+
## 🤗 DevAI Dataset
84+
85+
86+
87+
<div align="center">
88+
<img src="assets/dataset.png" alt="Dataset" style="width: 100%; max-width: 600px;">
89+
</div>
90+
91+
> [!IMPORTANT]
92+
> As a **proof-of-concept**, we applied **Agent-as-a-Judge** to code generation tasks using **DevAI**, a benchmark consisting of 55 realistic AI development tasks with 365 hierarchical user requirements. The results demonstrate that **Agent-as-a-Judge** significantly outperforms traditional evaluation methods, delivering reliable reward signals for scalable self-improvement in agentic systems.
93+
>
94+
> Check out the dataset on [Hugging Face 🤗](https://huggingface.co/DEVAI-benchmark).
95+
> See how to use this dataset in the [guidelines](benchmark/devai/README.md).
96+
97+
98+
<!-- <div align="center">
99+
<img src="assets/sample.jpeg" alt="Sample" style="width: 100%; max-width: 600px;">
100+
</div> -->
101+
102+
## Reference
103+
104+
Feel free to cite if you find the Agent-as-a-Judge concept useful for your work:
105+
106+
```
107+
@article{zhuge2024agent,
108+
title={Agent-as-a-Judge: Evaluate Agents with Agents},
109+
author={Zhuge, Mingchen and Zhao, Changsheng and Ashley, Dylan and Wang, Wenyi and Khizbullin, Dmitrii and Xiong, Yunyang and Liu, Zechun and Chang, Ernie and Krishnamoorthi, Raghuraman and Tian, Yuandong and Shi, Yangyang and Chandra, Vikas and Schmidhuber, J{\"u}rgen},
110+
journal={arXiv preprint arXiv:2410.10934},
111+
year={2024}
112+
}
113+
```
114+
115+

agent_as_a_judge/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
from .llm.provider import LLM
2+
from .llm.cost import Cost
3+
4+
__all__ = ["LLM", "Cost"]

0 commit comments

Comments
 (0)