Skip to content

Conversation

@bwasti
Copy link

@bwasti bwasti commented Nov 10, 2025

As in title, text can be found in PR content

Signed-off-by: Bram Wasti <bwasti@fb.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Downloaded GitHub-hosted images to local assets directory and updated
all image references to use local paths. Converted standalone images to
markdown syntax while keeping centered images as HTML img tags for
proper rendering.

Signed-off-by: Bram Wasti <bwasti@meta.com>
Added Jekyll frontmatter with layout, title, and author metadata to
properly render the blog post.

Signed-off-by: Bram Wasti <bwasti@meta.com>
Restored the original width and height attributes (340x130 and 480x355)
for the two centered images to maintain their fixed sizing.

Signed-off-by: Bram Wasti <bwasti@meta.com>
Added horizontal rule and italic formatting to the acknowledgements
section for better visual separation and styling.

Signed-off-by: Bram Wasti <bwasti@meta.com>

In the septillions of flops used to pre-train models, this mismatch between values has largely been avoidable. Pre-training typically runs at a fixed batch size which induces the same reduction kernels to be run - often side-stepping the issue entirely.

Reinforcement learning, on the other hand, seems to almost exclusively run different reduction algorithms due to its inference-heavy (and thus largely latency and memory-bound) nature. Kernels optimized for low-batch size inference typically run reductions all at once, whereas kernels for training models parallelize heavily to reuse data and amp up compute utilization. That means the generators and the trainers are typically running completely different kernels!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernels optimized for low-batch size inference typically run reductions all at once

I don't understand this part. Are you talking about reductions like in RMS norm?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the kernels, they don't tile. let me use the word "tile" for clarity

Signed-off-by: Bram Wasti <bwasti@fb.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants