-
Notifications
You must be signed in to change notification settings - Fork 38
Add batch invariant RL post #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Bram Wasti <bwasti@fb.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Downloaded GitHub-hosted images to local assets directory and updated all image references to use local paths. Converted standalone images to markdown syntax while keeping centered images as HTML img tags for proper rendering. Signed-off-by: Bram Wasti <bwasti@meta.com>
Added Jekyll frontmatter with layout, title, and author metadata to properly render the blog post. Signed-off-by: Bram Wasti <bwasti@meta.com>
Restored the original width and height attributes (340x130 and 480x355) for the two centered images to maintain their fixed sizing. Signed-off-by: Bram Wasti <bwasti@meta.com>
Added horizontal rule and italic formatting to the acknowledgements section for better visual separation and styling. Signed-off-by: Bram Wasti <bwasti@meta.com>
|
|
||
| In the septillions of flops used to pre-train models, this mismatch between values has largely been avoidable. Pre-training typically runs at a fixed batch size which induces the same reduction kernels to be run - often side-stepping the issue entirely. | ||
|
|
||
| Reinforcement learning, on the other hand, seems to almost exclusively run different reduction algorithms due to its inference-heavy (and thus largely latency and memory-bound) nature. Kernels optimized for low-batch size inference typically run reductions all at once, whereas kernels for training models parallelize heavily to reuse data and amp up compute utilization. That means the generators and the trainers are typically running completely different kernels! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernels optimized for low-batch size inference typically run reductions all at once
I don't understand this part. Are you talking about reductions like in RMS norm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the kernels, they don't tile. let me use the word "tile" for clarity
Signed-off-by: Bram Wasti <bwasti@fb.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
As in title, text can be found in PR content