We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
1 parent 17a71ee commit ad78967Copy full SHA for ad78967
examples/grpo_blackjack/README.md
@@ -74,17 +74,6 @@ env.close() # Cleanup
74
75
Change one environment variable → train on different games!
76
77
-### GRPO: Group Relative Policy Optimization
78
-
79
-GRPO is more efficient than PPO (used by ChatGPT):
80
81
-| Algorithm | Models Needed | Memory | Speed |
82
-|-----------|---------------|--------|-------|
83
-| PPO (ChatGPT) | 3 (Policy, Reference, Value) | High | Slower |
84
-| **GRPO (DeepSeek R1)** | **2 (Policy, Reference)** | **Lower** | **Faster** |
85
86
-Key insight: Sample the model multiple times per question, compute group statistics → no Value Model needed!
87
88
### Forge: PyTorch-Native Agentic RL
89
90
Forge handles all distributed systems complexity:
0 commit comments