|
| 1 | +# GRPO with Language Reward |
| 2 | + |
| 3 | +This sandbox app demonstrates using GRPO training with a language reward that encourages the model to think in a specific target language. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This app extends the standard GRPO training (from `apps/grpo/`) by adding a `LanguageReward` that evaluates whether the model's thinking (text within `<think></think>` tags) is in the target language. |
| 8 | + |
| 9 | +## Key Features |
| 10 | + |
| 11 | +- **Multi-objective training**: Combines three rewards: |
| 12 | + - `MathReward`: Evaluates correctness of math answers |
| 13 | + - `ThinkingReward`: Encourages use of thinking tags |
| 14 | + - `LanguageReward`: Rewards thinking in target language (Japanese by default) |
| 15 | + |
| 16 | +- **Language detection**: Uses `langid` to detect the language of thinking blocks |
| 17 | + |
| 18 | +- **Configurable target language**: While this app defaults to Japanese (`ja`), the `LanguageReward` can be configured for any ISO 639-1 language code |
| 19 | + |
| 20 | +## Requirements |
| 21 | + |
| 22 | +Before running this app, install the required language detection library: |
| 23 | + |
| 24 | +```bash |
| 25 | +pip install langid |
| 26 | +``` |
| 27 | + |
| 28 | +## Usage |
| 29 | + |
| 30 | +```bash |
| 31 | +python -m sandbox.grpo_language.main --config sandbox/grpo_language/qwen3_1_7b.yaml |
| 32 | +``` |
| 33 | + |
| 34 | +## How It Works |
| 35 | + |
| 36 | +1. The model receives a math problem and is instructed to use `<think>` tags for reasoning |
| 37 | +2. During training, the model generates responses with thinking blocks |
| 38 | +3. Three rewards are computed: |
| 39 | + - Math correctness (did it get the right answer?) |
| 40 | + - Thinking usage (did it use thinking tags properly?) |
| 41 | + - Language usage (did it think in Japanese?) |
| 42 | +4. The model is trained to maximize all three rewards |
| 43 | + |
| 44 | +## Configuration |
| 45 | + |
| 46 | +The target language is hardcoded as Japanese in `main.py` (line 321): |
| 47 | + |
| 48 | +```python |
| 49 | +LanguageReward(target_language="ja") |
| 50 | +``` |
| 51 | + |
| 52 | +To use a different language, modify this line with the appropriate ISO 639-1 code: |
| 53 | +- English: `"en"` |
| 54 | +- Chinese: `"zh"` |
| 55 | +- Spanish: `"es"` |
| 56 | +- French: `"fr"` |
| 57 | +- etc. |
| 58 | + |
| 59 | +## Expected Behavior |
| 60 | + |
| 61 | +Over the course of training, the model should learn to: |
| 62 | +1. Solve math problems correctly |
| 63 | +2. Use `<think></think>` tags for its reasoning |
| 64 | +3. Write its thinking in Japanese (or the configured target language) |
| 65 | + |
| 66 | +## Metrics |
| 67 | + |
| 68 | +The following metrics are logged to W&B: |
| 69 | +- `reward/evaluate_response/avg_LanguageReward_reward`: Average language reward |
| 70 | +- `reward/evaluate_response/avg_MathReward_reward`: Average math reward |
| 71 | +- `reward/evaluate_response/avg_ThinkingReward_reward`: Average thinking reward |
| 72 | +- `reward/evaluate_response/avg_total_reward`: Average of all rewards |
| 73 | + |
| 74 | +## Differences from Standard GRPO |
| 75 | + |
| 76 | +This is a modified version of `apps/grpo/main.py` with: |
| 77 | +1. Added import: `from forge.data.rewards import LanguageReward` |
| 78 | +2. Modified reward functions list to include `LanguageReward(target_language="ja")` |
| 79 | +3. Updated config to use different W&B group name |
| 80 | + |
| 81 | +All other training logic remains the same. |
0 commit comments