Skip to content

Commit 5e0df9a

Browse files
committed
Fix comments
Signed-off-by: Vladimir Suvorov <suvorovv@google.com>
1 parent 6175903 commit 5e0df9a

File tree

1 file changed

+1
-6
lines changed

1 file changed

+1
-6
lines changed

docs/tutorials/grpo_with_pathways.md

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -69,9 +69,4 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
6969
--hf_access_token=$HF_TOKEN"
7070
```
7171

72-
The overview of the demo script ~/maxtext/src/MaxText/examples/grpo_llama3_1_70b_demo_pw.py` is as follows:
73-
74-
1. We load a policy model and a reference model. Both are copies of `Llama3.1-70b-Instruct`.
75-
2. Evaluate the policy model's performance on GSM8K math reasoning benchmark.
76-
3. Train the policy model using GRPO with potentially different meshes for trainer and rollout depending on the parameters `TRAINER_DEVICES_FRACTION` and `SAMPLER_DEVICES_FRACTION`. If we set both of these to `1.0`, the entire (same) mesh will be used for both trainer and rollout. If we set say `TRAINER_DEVICES_FRACTION=0.5` and `SAMPLER_DEVICES_FRACTION=0.5`, the first half of the devices will be used for trainer and the second half will be used for rollout
77-
4. Evaluate the policy model's performance on GSM8K math reasoning benchmark after the post-training with GRPO.
72+
For an interactive walkthrough, open `src/MaxText/examples/grpo_llama3_1_8b_demo.ipynb`. The notebook now delegates to `rl_train`, so you can reuse the same configuration flags shown above (including `trainer_devices_fraction` and `sampler_devices_fraction`) when scaling to multi-host Pathways or to larger checkpoints such as Llama3.1-70B.

0 commit comments

Comments
 (0)