[megatron] support megatron MTP #6496

Jintao-Huang · 2025-11-08T07:52:13Z

No description provided.

gemini-code-assist · 2025-11-08T07:52:34Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Megatron framework by introducing support for Multi-Token Prediction (MTP) within the GPT model. The changes involve adding new configurable parameters for MTP layers and loss scaling, alongside a substantial refactoring of the GPT model's forward pass. This refactoring separates input preprocessing and output postprocessing into distinct methods, creating a more modular and maintainable architecture that seamlessly incorporates MTP as an additional training objective.

Highlights

Multi-Token Prediction (MTP) Support: Introduced new command-line parameters (mtp_num_layers and mtp_loss_scaling_factor) to enable and configure Multi-Token Prediction, which extends the prediction scope to multiple future tokens.
GPTModel Forward Pass Refactoring: The GPTModel's forward method has been refactored into _preprocess and _postprocess helper methods. This modularization improves code organization and facilitates the integration of new features like MTP.
MTP Integration in GPTModel: MTP logic is now integrated into the _postprocess step of the GPTModel, allowing it to function as an additional training objective by computing and scaling MTP losses.
Documentation Updates: Updated both Chinese and English documentation for command-line parameters to include the newly added MTP configuration options.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for Multi-Token Prediction (MTP) in Megatron, including new command-line arguments and documentation. The core change is a significant refactoring of the GPTModel.forward method to accommodate the MTP logic. My review focuses on the correctness of this refactoring and the consistency of the documentation. I've identified a critical breaking API change in the forward method signature that needs to be addressed. Additionally, there are some language inconsistencies in the documentation files that should be corrected for clarity.

gemini-code-assist · 2025-11-08T07:53:50Z

swift/megatron/model/gpt_model.py

+        self,
+        input_ids: torch.Tensor,
+        position_ids: torch.Tensor,
+        attention_mask: torch.Tensor,


The signature of the forward method has been changed, making attention_mask a required positional argument. This is a breaking API change that will cause errors in other parts of the codebase that call this method without providing attention_mask or passing it as a keyword argument. For example, MultimodalGPTModel.forward in swift/megatron/model/mm_gpt_model.py will break. To maintain backward compatibility, attention_mask should remain an optional keyword argument.

Suggested change

attention_mask: torch.Tensor,

attention_mask: torch.Tensor = None,

gemini-code-assist · 2025-11-08T07:53:50Z

docs/source/Megatron-SWIFT/Command-line-parameters.md

+- mtp_num_layers: Number of Multi-Token Prediction (MTP) layers. MTP extends the prediction scope at each position to multiple future tokens. This MTP implementation uses D sequential modules to sequentially predict D additional tokens. Default is None.
+- mtp_loss_scaling_factor: Scaling factor of Multi-Token Prediction (MTP) loss. We compute the average of MTP losses across all depths, then multiply it by this scaling factor to obtain the overall MTP loss, which serves as an additional training objective. Default is 0.1.


The descriptions for the new MTP parameters are in English, but this documentation file is in Chinese. For consistency, please translate these descriptions into Chinese.

Suggested change

- mtp_num_layers: Number of Multi-Token Prediction (MTP) layers. MTP extends the prediction scope at each position to multiple future tokens. This MTP implementation uses D sequential modules to sequentially predict D additional tokens. Default is None.

- mtp_loss_scaling_factor: Scaling factor of Multi-Token Prediction (MTP) loss. We compute the average of MTP losses across all depths, then multiply it by this scaling factor to obtain the overall MTP loss, which serves as an additional training objective. Default is 0.1.

- mtp_num_layers: 多token预测（MTP）层的数量。MTP将每个位置的预测范围扩展到多个未来token。此MTP实现使用D个顺序模块依次预测D个额外的token。默认为None。

- mtp_loss_scaling_factor: 多token预测（MTP）损失的缩放因子。我们计算所有深度上MTP损失的平均值，然后乘以该缩放因子得到总体MTP损失，它将作为一个额外的训练目标。默认为0.1。

gemini-code-assist · 2025-11-08T07:53:50Z

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

+- mtp_num_layers: 多token预测（MTP）层的数量。MTP将每个位置的预测范围扩展到多个未来token。此MTP实现使用D个顺序模块依次预测D个额外的token。默认为None。
+- mtp_loss_scaling_factor: 多token预测（MTP）损失的缩放因子。我们计算所有深度上MTP损失的平均值，然后乘以该缩放因子得到总体MTP损失，它将作为一个额外的训练目标。默认为0.1。


The descriptions for the new MTP parameters are in Chinese, but this documentation file is in English. For consistency, please translate these descriptions into English.

Suggested change

- mtp_num_layers: 多token预测（MTP）层的数量。MTP将每个位置的预测范围扩展到多个未来token。此MTP实现使用D个顺序模块依次预测D个额外的token。默认为None。

- mtp_loss_scaling_factor: 多token预测（MTP）损失的缩放因子。我们计算所有深度上MTP损失的平均值，然后乘以该缩放因子得到总体MTP损失，它将作为一个额外的训练目标。默认为0.1。

- mtp_num_layers: Number of Multi-Token Prediction (MTP) layers. MTP extends the prediction scope at each position to multiple future tokens. This MTP implementation uses D sequential modules to sequentially predict D additional tokens. Default is None.

- mtp_loss_scaling_factor: Scaling factor of Multi-Token Prediction (MTP) loss. We compute the average of MTP losses across all depths, then multiply it by this scaling factor to obtain the overall MTP loss, which serves as an additional training objective. Default is 0.1.

support megatron MTP

506ae6a

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

Jintao-Huang added 4 commits November 9, 2025 18:42

Merge branch 'main' into support_megatron_mtp

d918efe

Merge branch 'main' into support_megatron_mtp

0bd955a

Merge branch 'main' into support_megatron_mtp

6c29553

update

6738b66

hjh0119 mentioned this pull request Nov 12, 2025

MTP Training Support #6551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[megatron] support megatron MTP #6496

[megatron] support megatron MTP #6496

Uh oh!

Jintao-Huang commented Nov 8, 2025

Uh oh!

gemini-code-assist bot commented Nov 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Uh oh!

gemini-code-assist bot Nov 8, 2025

Uh oh!

gemini-code-assist bot Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	attention_mask: torch.Tensor,
	attention_mask: torch.Tensor = None,

		- mtp_num_layers: Number of Multi-Token Prediction (MTP) layers. MTP extends the prediction scope at each position to multiple future tokens. This MTP implementation uses D sequential modules to sequentially predict D additional tokens. Default is None.
		- mtp_loss_scaling_factor: Scaling factor of Multi-Token Prediction (MTP) loss. We compute the average of MTP losses across all depths, then multiply it by this scaling factor to obtain the overall MTP loss, which serves as an additional training objective. Default is 0.1.

		- mtp_num_layers: 多token预测（MTP）层的数量。MTP将每个位置的预测范围扩展到多个未来token。此MTP实现使用D个顺序模块依次预测D个额外的token。默认为None。
		- mtp_loss_scaling_factor: 多token预测（MTP）损失的缩放因子。我们计算所有深度上MTP损失的平均值，然后乘以该缩放因子得到总体MTP损失，它将作为一个额外的训练目标。默认为0.1。

[megatron] support megatron MTP #6496

Are you sure you want to change the base?

[megatron] support megatron MTP #6496

Uh oh!

Conversation

Jintao-Huang commented Nov 8, 2025

Uh oh!

gemini-code-assist bot commented Nov 8, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant