Our approach builds a self-evolving system for enhancing LLMs' general reasoning capabilities through three collaborative roles:
-
Proposer: Generates new reasoning questions wrapped in <question>...</question>. Each question is evaluated for quality, difficulty, and format. Only high-quality and learnable questions are kept for training.
-
Solver: Answers the valid questions within <answer>...</answer>. Its performance helps measure task difficulty and provides feedback for both question generation and model improvement.
-
Judge: Evaluates questions and answers, reasoning in <think>...</think> and producing numeric scores in <score>...</score>. These scores serve as rewards for Proposer and Solver, enabling stable reinforcement learning.
All three roles share one underlying model and are updated together using Task-Relative REINFORCE++. The system forms a continuous self-improving loop that strengthens reasoning without external supervision.
| Model | ID Avg | OOD Avg | Total Avg |
|---|---|---|---|
| w/o reference questions | |||
| Qwen2.5-3B-Instruct | 63.34 | 41.32 | 55.33 |
| AZR | 67.09 | 41.33 | 57.72 |
| MAE (zero) | 68.37 | 42.48 | 58.51 |
| w/ reference questions | |||
| SFT | 63.28 | 37.41 | 53.87 |
| MAE (with reference) | 65.07 | 43.18 | 57.11 |
| MAE (no reference) | 67.51 | 41.86 | 58.18 |
| MAE (half reference) | 68.95 | 43.96 | 59.87 |
conda create -n mae python=3.10
conda activate mae
pip install -r requirements.txt
pip install -r flashattn_requirements.txt
python scripts/prepare_test_datasets.py
python -m absolute_zero_reasoner.data_construction.process_code_reasoning_dataIf you plan to use NVIDIA's integrated LLM service (NIM) for evaluation, you can obtain free API key(s) by registering an account at https://build.nvidia.com/nim.
Steps to register and save your API key(s):
- Go to https://build.nvidia.com/nim and create an account (or sign in with your existing NVIDIA account).
- After signing in, navigate to the API_KEYS section and create a new API key. You may create multiple keys (probably through multiple acounts) if you want to distribute load.
- Copy the generated API key(s).
- In the root of this repository, create a file named
api.jsonat the repository root (same directory asREADME.md) and store your keys in the following format:
{
"api_keys": [
"sk-xxxxxxx-your-first-key-xxxx",
"sk-yyyyyyy-your-second-key-yyyy"
]
}Specializing the prompt can make the model tend to produce questions in certain domain or give scores according to desired rules. Make sure that the prompts are in similar format as the default prompt we provide and put under absolute_zero_reasoner/data_construction/initial_prompt_templates.
Three resume modes are supported: disable, auto and resume_path. disable allows you to train from scratch. auto resumes the run from the latest checkpoint inside resume_dir. resume_path allows you to resume from any checkpoint you want.
trainer.resume_mode=auto \
trainer.resume_dir=<path_to_your_run_directory>\ # resume_dir has to be appointed if resume_mode is not `disable`
trainer.resume_from_path=<path_to_your_checkpoint>\ # resume_from_path can be set to any specific checkpoint you wish to resume training fromWhen resuming runs, you can also put the original run wandb id into the script, i.e., trainer.wandb_run_id=<run_id>.
We use 8x80GB GPUs for 3B models, scripts can be modified to achieve the same overall accumulated batch size for reproduction.
bash scripts/selfplay/mae.sh
# To explore different settings on reference questions, modify `include_references` to 0 or 1 for no reference and with referenceOther models are also supported in Multi-Agent Evolve framework, you can start the training for your own model by modifying actor_rollout_ref.model.path in scripts.
python -m absolute_zero_reasoner.utils.convert2hf \
<veRL_ckpt_path>/actor \
<veRL_ckpt_path>/actor/huggingface/ \
<hf_ckpt_path>The general benchmarks will be evaluated during the training process. For complete evaluation on general benchmarks, run the following scripts by setting the resume checkpoint.
bash scripts/evaluation/eval_ID.sh
bash scripts/evaluation/eval_OOD.sh
# If you wish to evaluate base model, just set resume_mode to `disable` in these scriptsWe use evalplus for code evaluation. A new conda env is needed for evalplus.
conda create -n evalplus python=3.11
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus@d362e933265c3e7e3df8101c930a89c3c470cd9f"
Evaluation:
```bash
condda activate evalplus
bash evaluation/code_eval/scripts/run_evalplus.sh 0 <humaneval|mbpp> <hf_ckpt_path>If you find Multi-Agent Evolve helpful, please cite us.
@misc{chen2025multiagentevolvellmselfimprove,
title={Multi-Agent Evolve: LLM Self-Improve through Co-evolution},
author={Yixing Chen and Yiding Wang and Siqi Zhu and Haofei Yu and Tao Feng and Muhan Zhan and Mostofa Patwary and Jiaxuan You},
year={2025},
eprint={2510.23595},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.23595},
}This project is inspired by and partially adapted from the Absolute Zero Reasoner (AZR) project. We thank the AZR authors for their open-source contributions and ideas.
Feel free to contact Yixing Chen and Yiding Wang via the following emails: polaris_dane@sjtu.edu.cn, yidingw@stu.pku.edu.cn

