Multi-Agent Evolve: LLM Self-Improve through Co-Evolution

✨ Getting Started • 🏋️ Training • 📃 Evaluation

🎈 Citation • 🌻 Acknowledgement • 📧 Contact • 📈 Star History

⚙️ Algorithm Flow

Our approach builds a self-evolving system for enhancing LLMs' general reasoning capabilities through three collaborative roles:

Proposer: Generates new reasoning questions wrapped in <question>...</question>. Each question is evaluated for quality, difficulty, and format. Only high-quality and learnable questions are kept for training.
Solver: Answers the valid questions within <answer>...</answer>. Its performance helps measure task difficulty and provides feedback for both question generation and model improvement.
Judge: Evaluates questions and answers, reasoning in <think>...</think> and producing numeric scores in <score>...</score>. These scores serve as rewards for Proposer and Solver, enabling stable reinforcement learning.

All three roles share one underlying model and are updated together using Task-Relative REINFORCE++. The system forms a continuous self-improving loop that strengthens reasoning without external supervision.

📊 Results

Main Results

Model	ID Avg	OOD Avg	Total Avg
w/o reference questions
Qwen2.5-3B-Instruct	63.34	41.32	55.33
AZR	67.09	41.33	57.72
MAE (zero)	68.37	42.48	58.51
w/ reference questions
SFT	63.28	37.41	53.87
MAE (with reference)	65.07	43.18	57.11
MAE (no reference)	67.51	41.86	58.18
MAE (half reference)	68.95	43.96	59.87

✨ Getting Started

🎄 Environment Setup

conda create -n mae python=3.10
conda activate mae
pip install -r requirements.txt
pip install -r flashattn_requirements.txt
python scripts/prepare_test_datasets.py 
python -m absolute_zero_reasoner.data_construction.process_code_reasoning_data

🔗 Prepare API Key(s)

If you plan to use NVIDIA's integrated LLM service (NIM) for evaluation, you can obtain free API key(s) by registering an account at https://build.nvidia.com/nim.

Steps to register and save your API key(s):

Go to https://build.nvidia.com/nim and create an account (or sign in with your existing NVIDIA account).
After signing in, navigate to the API_KEYS section and create a new API key. You may create multiple keys (probably through multiple acounts) if you want to distribute load.
Copy the generated API key(s).
In the root of this repository, create a file named api.json at the repository root (same directory as README.md) and store your keys in the following format:

{
  "api_keys": [
    "sk-xxxxxxx-your-first-key-xxxx",
    "sk-yyyyyyy-your-second-key-yyyy"
  ]
}

🛠️ Prompts

Specializing the prompt can make the model tend to produce questions in certain domain or give scores according to desired rules. Make sure that the prompts are in similar format as the default prompt we provide and put under absolute_zero_reasoner/data_construction/initial_prompt_templates.

🏋️ Training

🌚 Resuming Runs

Three resume modes are supported: disable, auto and resume_path. disable allows you to train from scratch. auto resumes the run from the latest checkpoint inside resume_dir. resume_path allows you to resume from any checkpoint you want.

    trainer.resume_mode=auto \
    trainer.resume_dir=<path_to_your_run_directory>\ # resume_dir has to be appointed if resume_mode is not `disable`
    trainer.resume_from_path=<path_to_your_checkpoint>\ # resume_from_path can be set to any specific checkpoint you wish to resume training from

When resuming runs, you can also put the original run wandb id into the script, i.e., trainer.wandb_run_id=<run_id>.

♟️ Multi-Agent Evolve Training

We use 8x80GB GPUs for 3B models, scripts can be modified to achieve the same overall accumulated batch size for reproduction.

bash scripts/selfplay/mae.sh
# To explore different settings on reference questions, modify `include_references` to 0 or 1 for no reference and with reference

Other models are also supported in Multi-Agent Evolve framework, you can start the training for your own model by modifying actor_rollout_ref.model.path in scripts.

🤗 Converting veRL checkpoints to HF format

python -m absolute_zero_reasoner.utils.convert2hf \
  <veRL_ckpt_path>/actor \
  <veRL_ckpt_path>/actor/huggingface/ \
  <hf_ckpt_path>

📃 Evaluation

General Benchmarks

The general benchmarks will be evaluated during the training process. For complete evaluation on general benchmarks, run the following scripts by setting the resume checkpoint.

bash scripts/evaluation/eval_ID.sh
bash scripts/evaluation/eval_OOD.sh
# If you wish to evaluate base model, just set resume_mode to `disable` in these scripts

Code Benchmarks

We use evalplus for code evaluation. A new conda env is needed for evalplus.

conda create -n evalplus python=3.11
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus@d362e933265c3e7e3df8101c930a89c3c470cd9f"
Evaluation:
```bash
condda activate evalplus
bash evaluation/code_eval/scripts/run_evalplus.sh 0 <humaneval|mbpp> <hf_ckpt_path>

🎈 Citation

If you find Multi-Agent Evolve helpful, please cite us.

@misc{chen2025multiagentevolvellmselfimprove,
      title={Multi-Agent Evolve: LLM Self-Improve through Co-evolution}, 
      author={Yixing Chen and Yiding Wang and Siqi Zhu and Haofei Yu and Tao Feng and Muhan Zhan and Mostofa Patwary and Jiaxuan You},
      year={2025},
      eprint={2510.23595},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.23595}, 
}

🌻 Acknowledgement

This project is inspired by and partially adapted from the Absolute Zero Reasoner (AZR) project. We thank the AZR authors for their open-source contributions and ideas.

📧 Contact

Feel free to contact Yixing Chen and Yiding Wang via the following emails: polaris_dane@sjtu.edu.cn, yidingw@stu.pku.edu.cn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Agent Evolve: LLM Self-Improve through Co-Evolution

⚙️ Algorithm Flow

📊 Results

Main Results

✨ Getting Started

🎄 Environment Setup

🔗 Prepare API Key(s)

🛠️ Prompts

🏋️ Training

🌚 Resuming Runs

♟️ Multi-Agent Evolve Training

🤗 Converting veRL checkpoints to HF format

📃 Evaluation

General Benchmarks

Code Benchmarks

🎈 Citation

🌻 Acknowledgement

📧 Contact

📈 Star History

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
absolute_zero_reasoner		absolute_zero_reasoner
assets		assets
data		data
evaluation		evaluation
extras		extras
scripts		scripts
LICENSE		LICENSE
README.md		README.md
flashattn_requirements.txt		flashattn_requirements.txt
requirements.txt		requirements.txt

License

ulab-uiuc/Multi-agent-evolve

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent Evolve: LLM Self-Improve through Co-Evolution

⚙️ Algorithm Flow

📊 Results

Main Results

✨ Getting Started

🎄 Environment Setup

🔗 Prepare API Key(s)

🛠️ Prompts

🏋️ Training

🌚 Resuming Runs

♟️ Multi-Agent Evolve Training

🤗 Converting veRL checkpoints to HF format

📃 Evaluation

General Benchmarks

Code Benchmarks

🎈 Citation

🌻 Acknowledgement

📧 Contact

📈 Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages