Skip to content

Commit 17a71ee

Browse files
committed
Create README.md
1 parent f0a9709 commit 17a71ee

File tree

1 file changed

+202
-0
lines changed

1 file changed

+202
-0
lines changed

examples/grpo_blackjack/README.md

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# Training LLMs to Play BlackJack with GRPO + OpenEnv
2+
3+
This example demonstrates how to train language models to play BlackJack using **GRPO (Group Relative Policy Optimization)** and **OpenEnv**.
4+
5+
## 🎯 What This Example Shows
6+
7+
- **OpenEnv**: Universal RL environment interface for 70+ environments
8+
- **GRPO**: Efficient RL algorithm (used by DeepSeek R1) that only needs 2 models instead of 3
9+
- **Forge**: PyTorch-native agentic RL library for production training
10+
- **End-to-End Training**: From random policy (~35% win rate) to trained agent
11+
12+
## 📁 Files
13+
14+
- `grpo_blackjack_tutorial.ipynb` - Interactive tutorial notebook (recommended starting point)
15+
- `grpo_utils.py` - Production GRPO utilities and helper functions
16+
- `blackjack.yaml` - Training configuration file
17+
- `README.md` - This file
18+
19+
## 🚀 Quick Start
20+
21+
### Prerequisites
22+
23+
1. **Install OpenEnv**:
24+
```bash
25+
# Clone OpenEnv repo
26+
git clone https://github.com/meta-pytorch/OpenEnv.git
27+
cd OpenEnv
28+
pip install -e .
29+
```
30+
31+
2. **Install Forge** (PyTorch's agentic RL library):
32+
```bash
33+
git clone https://github.com/meta-pytorch/torchforge.git
34+
cd torchforge
35+
pip install -e .
36+
```
37+
38+
3. **Start OpenEnv BlackJack Server**:
39+
```bash
40+
# In a separate terminal
41+
export OPENENV_PATH="/path/to/OpenEnv/src"
42+
export PYTHONPATH="${OPENENV_PATH}:${PYTHONPATH}"
43+
44+
OPENSPIEL_GAME=blackjack python -m envs.openspiel_env.server.app --port 8004
45+
```
46+
47+
### Run the Tutorial
48+
49+
Open the Jupyter notebook:
50+
```bash
51+
jupyter notebook grpo_blackjack_tutorial.ipynb
52+
```
53+
54+
Follow the cells to:
55+
1. **Explore OpenEnv** - Connect to BlackJack environment
56+
2. **Benchmark baseline** - Test random policy performance
57+
3. **Learn about GRPO** - Understand the training algorithm
58+
4. **Train with Forge** - Run production GRPO training
59+
5. **Switch environments** - See how to train on other games
60+
61+
## 📚 What You'll Learn
62+
63+
### OpenEnv: Universal RL Environment Spec
64+
65+
OpenEnv is **not a game engine** - it's a **specification** that wraps ANY RL environment:
66+
67+
```python
68+
# Same interface works for 70+ environments
69+
result = env.reset() # Start episode
70+
result = env.step(action) # Take action
71+
state = env.state() # Get state
72+
env.close() # Cleanup
73+
```
74+
75+
Change one environment variable → train on different games!
76+
77+
### GRPO: Group Relative Policy Optimization
78+
79+
GRPO is more efficient than PPO (used by ChatGPT):
80+
81+
| Algorithm | Models Needed | Memory | Speed |
82+
|-----------|---------------|--------|-------|
83+
| PPO (ChatGPT) | 3 (Policy, Reference, Value) | High | Slower |
84+
| **GRPO (DeepSeek R1)** | **2 (Policy, Reference)** | **Lower** | **Faster** |
85+
86+
Key insight: Sample the model multiple times per question, compute group statistics → no Value Model needed!
87+
88+
### Forge: PyTorch-Native Agentic RL
89+
90+
Forge handles all distributed systems complexity:
91+
- **Generator (vLLM)**: Fast LLM inference
92+
- **RLTrainer**: Distributed training with FSDP
93+
- **ReplayBuffer**: Off-policy learning
94+
- **ReferenceModel**: KL penalty computation
95+
- **Torchstore**: Distributed weight management
96+
97+
You just write:
98+
```python
99+
trainer = await setup_forge_training("blackjack.yaml")
100+
await trainer.run(steps=100)
101+
```
102+
103+
Everything else is automated!
104+
105+
## 🎓 Educational Resources
106+
107+
This tutorial is inspired by the excellent [Unsloth RL Guide](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide). We highly recommend reading it for deeper insights!
108+
109+
### Further Reading
110+
111+
- **OpenEnv**: [GitHub](https://github.com/meta-pytorch/OpenEnv)
112+
- **GRPO Paper**: [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
113+
- **Forge**: [GitHub](https://github.com/meta-pytorch/torchforge) | [Docs](https://meta-pytorch.org/torchforge/)
114+
- **Unsloth RL Guide**: [docs.unsloth.ai](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide)
115+
116+
## 💡 Key Concepts
117+
118+
### "Patience Is All You Need" for RL
119+
120+
RL works by patience: if the correct answer has *any* non-zero probability, we'll eventually find it through sampling. While waiting:
121+
1. Learn from **bad answers** → decrease their probability
122+
2. When finding **good answers** → increase their probability
123+
124+
Over time, the model learns not just *what* to do, but *why* (reasoning process).
125+
126+
### Reward Functions
127+
128+
Reward functions tell the model what's good/bad. For BlackJack:
129+
130+
```python
131+
def evaluate_response(prompt, response, game_reward):
132+
reward = float(game_reward) # +1 (win), -1 (loss), 0 (push)
133+
134+
# Reward shaping
135+
if game_reward > 0:
136+
reward = 2.0 # Wins more valuable
137+
elif game_reward == 0:
138+
reward = 0.5 # Pushes better than losses
139+
140+
return reward
141+
```
142+
143+
The key: **Reward functions must be verifiable**. You can verify "is the answer correct?" but not "is this creative?"
144+
145+
## 🔄 Switching to Other Games
146+
147+
The beauty of OpenEnv: **same code works for any environment!**
148+
149+
### Try Tic-Tac-Toe
150+
```bash
151+
OPENSPIEL_GAME=tic_tac_toe python -m envs.openspiel_env.server.app --port 8005
152+
```
153+
Update config: `server_url = "http://localhost:8005"`
154+
155+
### Try Chess
156+
```bash
157+
OPENSPIEL_GAME=chess python -m envs.openspiel_env.server.app --port 8006
158+
```
159+
160+
### Try Atari
161+
```bash
162+
python -m envs.atari_env.server.app --game pong --port 8007
163+
```
164+
165+
Everything else stays the same! Same GRPO code, same Forge infrastructure.
166+
167+
## 🛠️ Customization
168+
169+
All code is in `grpo_utils.py`:
170+
- Modify `BlackJackReward.evaluate_response()` for reward shaping
171+
- Adjust `ComputeAdvantages.compute()` for advantage computation
172+
- Tweak `simple_grpo_loss()` for KL penalty (beta parameter)
173+
- Change `format_prompt()` for different prompt templates
174+
175+
Edit `blackjack.yaml` for:
176+
- Different model sizes (1B to 70B+)
177+
- More training steps
178+
- Larger group sizes
179+
- Parallel rollout collection
180+
181+
## 📊 Expected Results
182+
183+
- **Random policy**: ~35% win rate
184+
- **After GRPO training**: Improves toward optimal BlackJack strategy (~43% win rate)
185+
- **Training time**: Varies based on model size and training steps
186+
187+
The model learns both strategy AND reasoning process (similar to DeepSeek R1's `<think>` tokens).
188+
189+
## 🤝 Credits
190+
191+
- **OpenEnv**: Meta PyTorch team
192+
- **Forge**: Meta PyTorch team
193+
- **GRPO**: DeepSeek research team
194+
- **Tutorial inspiration**: Unsloth team
195+
196+
## 📝 License
197+
198+
This example follows the same license as the parent OpenEnv repository.
199+
200+
## 🙏 Acknowledgments
201+
202+
Big thanks to the **Unsloth team** for their educational approach to RL! This tutorial's GRPO section is heavily inspired by their excellent guide.

0 commit comments

Comments
 (0)