Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
4316425
wip
hjh0119 Aug 29, 2025
5d46eae
init wip
hjh0119 Sep 1, 2025
5828229
args wip
hjh0119 Sep 1, 2025
a82cec4
Merge remote-tracking branch 'origin/main' into mega-grpo
hjh0119 Sep 2, 2025
0689b76
reuse _prepare_rollout_engine
hjh0119 Sep 3, 2025
46593cf
merge main
hjh0119 Sep 11, 2025
3da8756
mega wip
hjh0119 Sep 12, 2025
2ca7ac1
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 17, 2025
d9ec029
wip
hjh0119 Sep 17, 2025
7c56f9f
override train_step wip
hjh0119 Sep 17, 2025
686fc74
remove override train_step to grpo
hjh0119 Sep 18, 2025
095bcbd
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 18, 2025
4d9457b
sync weight wip
hjh0119 Sep 18, 2025
f52d5e1
rollout wip
hjh0119 Sep 19, 2025
155d4fb
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 22, 2025
3c69c39
modify mini_batch_size to generation batch size
hjh0119 Sep 22, 2025
eebdd47
wip
hjh0119 Sep 24, 2025
de6ecfe
loss wip
hjh0119 Sep 28, 2025
4569e54
fix repeat n
hjh0119 Sep 28, 2025
f118935
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 29, 2025
9cb84e3
fix padding to multiple of tp_size
hjh0119 Sep 29, 2025
8627aa3
compute loss
hjh0119 Sep 29, 2025
2292cf8
fix logps
hjh0119 Sep 30, 2025
bbe5f39
logging & patch VL
hjh0119 Sep 30, 2025
6a2940c
fix rollout_group & rollout judgement
hjh0119 Oct 1, 2025
486c3d4
fix step
hjh0119 Oct 6, 2025
7e8e6b0
merge main
hjh0119 Oct 6, 2025
c68d976
move old base trainer to newer
hjh0119 Oct 7, 2025
6b1653c
fix
hjh0119 Oct 8, 2025
d4a9dcc
offload utils
hjh0119 Oct 8, 2025
9dc92a0
offload context
hjh0119 Oct 9, 2025
7bc3d61
Resolve merge conflict in megatron_args.py by removing duplicate fiel…
hjh0119 Oct 9, 2025
91f97ca
fix resolve
hjh0119 Oct 9, 2025
59f436c
fix logps
hjh0119 Oct 9, 2025
8dea6d7
fix old logps
hjh0119 Oct 9, 2025
abac696
reduce redundancy
hjh0119 Oct 9, 2025
3a3ff37
replace token
hjh0119 Oct 10, 2025
2cd89dc
fix offload model
hjh0119 Oct 10, 2025
50d5e6f
offload optimizer & ref
hjh0119 Oct 11, 2025
e1a06c6
support cp
hjh0119 Oct 11, 2025
ff9b667
fix pp+cp
hjh0119 Oct 11, 2025
ba4bfbf
lora wip
hjh0119 Oct 11, 2025
e5a6252
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Oct 13, 2025
e22c790
arguments document
hjh0119 Oct 13, 2025
b3de262
wip lora&cp
hjh0119 Oct 14, 2025
d5bd92c
merge origin
hjh0119 Oct 14, 2025
fe3270f
remove unused patch
hjh0119 Oct 14, 2025
137704e
merge main
hjh0119 Oct 29, 2025
ca9c9bc
wip server
hjh0119 Oct 29, 2025
f258202
wip
hjh0119 Oct 29, 2025
85a035e
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Oct 29, 2025
0a38c0c
server rollout wip
hjh0119 Oct 30, 2025
e0fc2e9
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Nov 4, 2025
5f2f349
move vllm client init out of args
hjh0119 Nov 4, 2025
416feb2
server mode
hjh0119 Nov 4, 2025
85135bb
merge main
hjh0119 Nov 4, 2025
b93c031
remove old func
hjh0119 Nov 4, 2025
2f5d7b5
mcore bridge
hjh0119 Nov 4, 2025
edf3378
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Nov 4, 2025
b3b37ce
merge main & flatten weight sync
hjh0119 Nov 5, 2025
1d930d8
dynamic sample
hjh0119 Nov 5, 2025
5f9e14a
fix dynamic sampling
hjh0119 Nov 5, 2025
b753911
merge main
hjh0119 Nov 6, 2025
d1460c2
fix cp: compute part of seq loss
hjh0119 Nov 7, 2025
db7000d
optimize weight sync & fix vllm_tp
hjh0119 Nov 7, 2025
bee6925
fix vllm tp
hjh0119 Nov 7, 2025
f92739a
fix vllm tp distribute outputs twice
hjh0119 Nov 7, 2025
bd3f9c9
length context for template encode
hjh0119 Nov 7, 2025
3f9b11d
fix sequence is wip"
hjh0119 Nov 9, 2025
e34d02d
merge main
hjh0119 Nov 9, 2025
1b4fd80
fix padding loss calculate
hjh0119 Nov 10, 2025
e9a307d
log completions
hjh0119 Nov 10, 2025
8b1e407
bug todo
hjh0119 Nov 10, 2025
b6c10eb
fix wip
hjh0119 Nov 10, 2025
b25c7d2
merge main
hjh0119 Nov 11, 2025
06b3693
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Nov 11, 2025
46b8e2a
fix sp padding
hjh0119 Nov 12, 2025
87bbad7
fix pp
hjh0119 Nov 12, 2025
1927712
revert to full seq loss for cp
hjh0119 Nov 12, 2025
44920c0
fix server client init in first rank instead of last rank
hjh0119 Nov 13, 2025
7ebfdda
fix server mode
hjh0119 Nov 13, 2025
43fc27d
fix server pass prompt
hjh0119 Nov 13, 2025
1696ea9
dense script
hjh0119 Nov 13, 2025
a629fbd
check batch size params
hjh0119 Nov 13, 2025
852d0f0
dense server script
hjh0119 Nov 13, 2025
2f98eba
moe script
hjh0119 Nov 14, 2025
1936a83
docs
hjh0119 Nov 14, 2025
500408a
merge main
hjh0119 Nov 14, 2025
bdcaa51
update doc
hjh0119 Nov 14, 2025
027ca57
update doc & args check
hjh0119 Nov 14, 2025
360da42
clean up
hjh0119 Nov 14, 2025
6259a34
clean up
hjh0119 Nov 14, 2025
5fdde02
clean up
hjh0119 Nov 14, 2025
b5be0ce
clean up
hjh0119 Nov 14, 2025
d8c7c3b
clean up
hjh0119 Nov 14, 2025
2aaa1e5
align scale_rewards
hjh0119 Nov 14, 2025
7540743
merge main
hjh0119 Nov 14, 2025
29ecb32
aggressive_empty_cache before wake up weights
hjh0119 Nov 14, 2025
ad00c5c
docs
hjh0119 Nov 14, 2025
5977fe5
sleep level doc
hjh0119 Nov 14, 2025
5ab6d37
fix kl metrics
hjh0119 Nov 14, 2025
2841fb9
fix arxiv link & fix kl metric
hjh0119 Nov 14, 2025
2a97c64
revert script
hjh0119 Nov 14, 2025
40706d7
revert server_base_url doc
hjh0119 Nov 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ You can contact us and communicate with us by adding our group:
- **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
- 🍊 **RLHF Training**: Supports human alignment training methods such as DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both pure text and multi-modal large models.
- 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding.
- 🥥 **Megatron Parallelism**: Supports accelerating CPT/SFT/DPO/KTO/RM using Megatron parallelism techniques, currently compatible with 200+ pure text large models, 100+ multi-modal large models.
- 🥥 **Megatron Parallelism**: Supports accelerating CPT/SFT/GRPO/DPO/KTO/RM using Megatron parallelism techniques, currently compatible with 200+ pure text large models, 100+ multi-modal large models.
- **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
- **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
- 🍉 **Toolbox Capabilities**: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment.
Expand All @@ -75,6 +75,7 @@ You can contact us and communicate with us by adding our group:


## 🎉 News
- 🎁 2025.11.14: Megatron GRPO is now available! Check out the [docs](./docs/source_en/Megatron-SWIFT/GRPO.md) and [examples](examples/megatron/grpo).
- 🎁 2025.11.04: Support for [Mcore-Bridge](docs/source_en/Megatron-SWIFT/Mcore-Bridge.md), making Megatron training as simple and easy to use as transformers.
- 🎁 2025.10.28: Ray [here](docs/source_en/Instruction/Ray.md).
- 🎁 2025.10.28: Support [use yaml](examples/yaml) to configure command line parameters.
Expand Down
3 changes: 2 additions & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
- **量化训练**:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
- 🍊 **RLHF训练**:支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
- 🍓 **多模态训练**:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。
- 🥥 **Megatron并行技术**:支持使用Megatron并行技术对CPT/SFT/DPO/KTO/RM进行加速,现支持200+纯文本大模型和100+多模态大模型。
- 🥥 **Megatron并行技术**:支持使用Megatron并行技术对CPT/SFT/GRPO/DPO/KTO/RM进行加速,现支持200+纯文本大模型和100+多模态大模型。
- **界面训练**:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。
- **插件化与拓展**:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
- 🍉 **工具箱能力**:不仅提供大模型和多模态大模型的训练支持,还涵盖其推理、评测、量化和部署全流程。
Expand All @@ -71,6 +71,7 @@
- **模型量化**:支持AWQ、GPTQ、FP8和BNB的量化导出,导出的模型支持使用vLLM/SGLang/LmDeploy推理加速,并支持继续训练。

## 🎉 新闻
- 🎁 2025.11.14: Megatron GRPO现已支持!查看[文档](./docs/source/Megatron-SWIFT/GRPO.md)和[示例](examples/megatron/grpo)。
- 🎁 2025.11.04: 支持[Mcore-Bridge](docs/source/Megatron-SWIFT/Mcore-Bridge.md),使Megatron训练像transformers一样简单易用。
- 🎁 2025.10.28: Ray [已支持](docs/source/Instruction/Ray.md)。
- 🎁 2025.10.28: 已支持[使用yaml](examples/yaml)配置命令行参数。
Expand Down
14 changes: 8 additions & 6 deletions docs/source/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -566,13 +566,13 @@ reward模型参数将在PPO、GRPO中使用。
- use_vllm: 是否使用 vLLM 作为 GRPO 生成的 infer_backend,默认为False。
- vllm_mode: vLLM 集成模式,可选项为 `server` 和 `colocate`。server 模式使用 `swift rollout` 拉起的 vLLM 服务器进行采样,colocate 模式在程序内部署 vLLM。使用server端时,
- vllm_mode server 参数
- vllm_server_host: vLLM server host地址,默认为None。
- vllm_server_port: vLLM server 服务端口,默认为8000。
- vllm_server_base_url: vLLM server的Base URL(比如 http://local_host:8000), 默认为None。设置后,忽略host和port设置。
- vllm_server_host:vLLM server host地址,默认为None。
- vllm_server_port vLLM server 服务端口,默认为8000。
- vllm_server_timeout 连接vLLM server的超时时间,默认为 240s。
- vllm_server_timeout: 连接vLLM server的超时时间,默认为 240s。
- vllm_server_pass_dataset: 透传额外的数据集信息到vLLM server,用于多轮训练。
- async_generate: 异步rollout以提高训练速度,注意开启时采样会使用上一轮更新的模型进行采样,不支持多轮场景。默认`false`.
- SWIFT_UPDATE_WEIGHTS_BUCKET_SIZE环境变量,用于控制权重同步时的传输桶大小(bucket size),适用于 Server Mode 下的全参数训练,单位为 MB,默认值为 512 MB。
- SWIFT_UPDATE_WEIGHTS_BUCKET_SIZE: 环境变量,用于控制权重同步时的传输桶大小(bucket size),适用于 Server Mode 下的全参数训练,单位为 MB,默认值为 512 MB。
- vllm_mode colocate 参数(更多参数支持参考[vLLM参数](#vLLM参数)。)
- vllm_gpu_memory_utilization: vllm透传参数,默认为0.9。
- vllm_max_model_len: vllm透传参数,默认为None。
Expand All @@ -581,7 +581,7 @@ reward模型参数将在PPO、GRPO中使用。
- vllm_enable_prefix_caching: vllm透传参数,默认为True。
- vllm_tensor_parallel_size: tp并行数,默认为`1`。
- vllm_enable_lora: 支持vLLM Engine 加载 LoRA adapter,默认为False。用于加速LoRA训练的权重同步,具体参考[文档](./GRPO/GetStarted/GRPO.md#权重同步加速)。
- sleep_level: 训练时释放 vLLM 显存,可选项为[0, 1], 默认为0,不释放。
- sleep_level: 训练时释放 vLLM 显存,可选项为[0, 1, 2], 默认为0,不释放。
- offload_optimizer: 是否在vLLM推理时offload optimizer参数,默认为False。
- offload_model: 是否在vLLM推理时 offload 模型,默认为False。
- completion_length_limit_scope: 在多轮对话中,`max_completion_length` 的限制范围。
Expand All @@ -593,7 +593,7 @@ reward模型参数将在PPO、GRPO中使用。
- max_resample_times:dynamic_sample设置下限制重采样次数,默认3次。
- overlong_filter:跳过超长截断的样本,不参与loss计算,默认为False。
- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置,建议大于 1 + epsilon。默认为None。
- importance_sampling_level: 控制重要性采样比计算,可选项为 `token` 和 `sequence`,`token` 模式下保留原始的每个 token 的对数概率比,`sequence` 模式下则会对序列中所有有效 token 的对数概率比进行平均。[GSPO论文](https://www.arxiv.org/abs/2507.18071)中使用sequence级别计算来稳定训练,默认为`token`。
- importance_sampling_level: 控制重要性采样比计算,可选项为 `token` 和 `sequence`,`token` 模式下保留原始的每个 token 的对数概率比,`sequence` 模式下则会对序列中所有有效 token 的对数概率比进行平均。[GSPO论文](https://arxiv.org/abs/2507.18071)中使用sequence级别计算来稳定训练,默认为`token`。
- advantage_estimator: 优势计算函数,默认为 `grpo`,即计算组内相对优势,可选项为 `grpo`、[`rloo`](./GRPO/AdvancedResearch/RLOO.md)、[`reinforce_plus_plus`](./GRPO/AdvancedResearch/REINFORCEPP.md)。
- kl_in_reward: 控制 KL 散度正则项的处理位置;`false`表示作为损失函数的独立正则项,`true`表示将 KL 直接并入奖励(从奖励中扣除)。默认情况与advantage_estimator绑定,`grpo`下默认为`false`,`rloo` 和 `reinforce_plus_plus` 下默认为 `true`。
- scale_rewards:指定奖励的缩放策略。可选值包括 `group`(按组内标准差缩放)、`batch`(按整个批次的标准差缩放)、`none`(不进行缩放)。在 ms-swift < 3.10 版本中,该参数为布尔类型,`true` 对应 `group`,`false` 对应 `none`。默认值与 `advantage_estimator` 绑定:`grpo` 对应 `group`,`rloo` 对应 `none`,`reinforce_plus_plus` 对应 `batch`。
Expand All @@ -606,6 +606,8 @@ reward模型参数将在PPO、GRPO中使用。
- top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算,默认为1.0,即不过滤低熵 token,具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
- log_entropy: 记录训练中的熵值变化动态,默认为False,具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)

##### 奖励函数参数
内置的奖励函数参考[文档](./GRPO/DeveloperGuide/reward_function.md)
cosine 奖励参数
- cosine_min_len_value_wrong:cosine 奖励函数参数,生成错误答案时,最小长度对应的奖励值。默认值为-0.5。
- cosine_max_len_value_wrong:生成错误答案时,最大长度对应的奖励值。默认值为0.0。
Expand Down
4 changes: 2 additions & 2 deletions docs/source/Instruction/GRPO/AdvancedResearch/GSPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**版本依赖**:ms-swift>=3.7

[Group Sequence Policy Optimization](https://www.arxiv.org/abs/2507.18071)中指出GRPO在计算重要性采样权重时,是在token级别进行操作的。然而,这种做法由于每个token仅采样一次,无法实现有效的分布校正,反而会在模型训练过程中引入高方差噪声,极易导致模型的梯度估计不稳定,最终造成模型训练的崩塌。因此,论文认为,优化目标的单位应该与奖励的单位保持一致。由于奖励通常是在序列级别(即完整生成的回复)给出的,因此更合理的做法是将 off-policy 校正和优化也提升到序列级别,而非 token 级别。以下是三种计算策略对比:
[Group Sequence Policy Optimization](https://arxiv.org/abs/2507.18071)中指出GRPO在计算重要性采样权重时,是在token级别进行操作的。然而,这种做法由于每个token仅采样一次,无法实现有效的分布校正,反而会在模型训练过程中引入高方差噪声,极易导致模型的梯度估计不稳定,最终造成模型训练的崩塌。因此,论文认为,优化目标的单位应该与奖励的单位保持一致。由于奖励通常是在序列级别(即完整生成的回复)给出的,因此更合理的做法是将 off-policy 校正和优化也提升到序列级别,而非 token 级别。以下是三种计算策略对比:

1. GRPO
对每个 token 独立计算重要性采样比,具体公式为
Expand Down Expand Up @@ -54,7 +54,7 @@ importance_weights = torch.exp(log_importance_weights)
- `importance_sampling_level sequence` (GSPO)
- `importance_sampling_level sequence_token` (GSPO-token)

其中 sequence_token 要求 ms-swift > 3.7 (源码安装)
其中 sequence_token 要求 ms-swift >= 3.8

论文其他超参
```bash
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Instruction/Use-tuners.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ tuner是指附加在模型上的额外结构部分,用于减少训练参数量
- Adapter: [Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
- Vision Prompt Tuning: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
- Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
- Res-Tuning: [Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone](https://arxiv.org/abs/2310.19859) < [arXiv](https://arxiv.org/abs/2310.19859) | [Project Page](https://res-tuning.github.io/) | [Usage](ResTuning.md) >
- Res-Tuning: [Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone](https://arxiv.org/abs/2310.19859) < [arXiv](https://arxiv.org/abs/2310.19859) | [Project Page](https://res-tuning.github.io/) >
- [PEFT](https://github.com/huggingface/peft)提供的tuners, 如AdaLoRA、DoRA、Fourierft等

## 接口列表
Expand Down
Loading
Loading