Skip to content

Commit 033fa60

Browse files
committed
Add LanguageReward for training models to think in target language
This commit introduces a new reward function that encourages models to think in a specific target language within their <think> tags. Key changes: - Add LanguageReward class to src/forge/data/rewards.py - Uses langid for language detection - Configurable target language (ISO 639-1 codes) - Returns full_reward for language match, no_match_reward otherwise - Raises helpful error if langid not installed - Add comprehensive unit tests in tests/unit_tests/rl/test_language_reward.py - Tests for multiple languages (English, Japanese, Chinese, Spanish, etc.) - Tests for edge cases and error handling - All 28 tests passing - Create sandbox/grpo_language/ app for experimentation - Extends apps/grpo/ with LanguageReward - Hardcoded to Japanese (ja) as default target language - Includes README with usage instructions - Config file for Qwen3-1.7B model Implementation details: - Extracts text from <think></think> tags for analysis - Concatenates multiple thinking blocks for language detection - Compatible with existing MathReward and ThinkingReward - Does not add langid to requirements.txt (optional dependency) Usage: python -m sandbox.grpo_language.main --config sandbox/grpo_language/qwen3_1_7b.yaml Note: Requires 'pip install langid' before use
1 parent 84d02f6 commit 033fa60

File tree

5 files changed

+1096
-0
lines changed

5 files changed

+1096
-0
lines changed

sandbox/grpo_language/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# GRPO with Language Reward
2+
3+
This sandbox app demonstrates using GRPO training with a language reward that encourages the model to think in a specific target language.
4+
5+
## Overview
6+
7+
This app extends the standard GRPO training (from `apps/grpo/`) by adding a `LanguageReward` that evaluates whether the model's thinking (text within `<think></think>` tags) is in the target language.
8+
9+
## Key Features
10+
11+
- **Multi-objective training**: Combines three rewards:
12+
- `MathReward`: Evaluates correctness of math answers
13+
- `ThinkingReward`: Encourages use of thinking tags
14+
- `LanguageReward`: Rewards thinking in target language (Japanese by default)
15+
16+
- **Language detection**: Uses `langid` to detect the language of thinking blocks
17+
18+
- **Configurable target language**: While this app defaults to Japanese (`ja`), the `LanguageReward` can be configured for any ISO 639-1 language code
19+
20+
## Requirements
21+
22+
Before running this app, install the required language detection library:
23+
24+
```bash
25+
pip install langid
26+
```
27+
28+
## Usage
29+
30+
```bash
31+
python -m sandbox.grpo_language.main --config sandbox/grpo_language/qwen3_1_7b.yaml
32+
```
33+
34+
## How It Works
35+
36+
1. The model receives a math problem and is instructed to use `<think>` tags for reasoning
37+
2. During training, the model generates responses with thinking blocks
38+
3. Three rewards are computed:
39+
- Math correctness (did it get the right answer?)
40+
- Thinking usage (did it use thinking tags properly?)
41+
- Language usage (did it think in Japanese?)
42+
4. The model is trained to maximize all three rewards
43+
44+
## Configuration
45+
46+
The target language is hardcoded as Japanese in `main.py` (line 321):
47+
48+
```python
49+
LanguageReward(target_language="ja")
50+
```
51+
52+
To use a different language, modify this line with the appropriate ISO 639-1 code:
53+
- English: `"en"`
54+
- Chinese: `"zh"`
55+
- Spanish: `"es"`
56+
- French: `"fr"`
57+
- etc.
58+
59+
## Expected Behavior
60+
61+
Over the course of training, the model should learn to:
62+
1. Solve math problems correctly
63+
2. Use `<think></think>` tags for its reasoning
64+
3. Write its thinking in Japanese (or the configured target language)
65+
66+
## Metrics
67+
68+
The following metrics are logged to W&B:
69+
- `reward/evaluate_response/avg_LanguageReward_reward`: Average language reward
70+
- `reward/evaluate_response/avg_MathReward_reward`: Average math reward
71+
- `reward/evaluate_response/avg_ThinkingReward_reward`: Average thinking reward
72+
- `reward/evaluate_response/avg_total_reward`: Average of all rewards
73+
74+
## Differences from Standard GRPO
75+
76+
This is a modified version of `apps/grpo/main.py` with:
77+
1. Added import: `from forge.data.rewards import LanguageReward`
78+
2. Modified reward functions list to include `LanguageReward(target_language="ja")`
79+
3. Updated config to use different W&B group name
80+
81+
All other training logic remains the same.

0 commit comments

Comments
 (0)