You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
add inplace_copy_batch_to_gpu in TrainPipeline (#3526)
Summary:
Pull Request resolved: #3526
This diff implements support for pre-allocation in-place copy for host-to-device data transfer in TorchRec train pipelines, addressing CUDA memory overhead issues identified in production RecSys models.
https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/
## Context
As described in the [RFC on Workplace](https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/), we identified an extra CUDA memory overhead of 3-6 GB per rank on top of the active memory snapshot in most RecSys model training pipelines. This overhead stems from PyTorch's caching allocator behavior when using side CUDA streams for non-blocking host-to-device transfers - the allocator associates transferred tensor memory with the side stream, preventing memory reuse in the main stream and causing up to 13GB extra memory footprint per rank in production models.
The solution proposed in [D86068070](https://www.internalfb.com/diff/D86068070) enables pre-allocating memory on the main stream and using in-place copy to reduce this overhead. In local train pipeline benchmarks with 1-GB ModelInput (2 KJTs + float features), this approach reduced memory footprint by ~6 GB per rank. This optimization enables many memory-constrained use cases across platforms including APS, Pyper, and MVAI.
## Key Changes:
1. **Added `inplace_copy_batch_to_gpu` parameter**: New boolean flag throughout the train pipeline infrastructure that enables switching between standard batch copying (direct allocation on side stream) and in-place copying (pre-allocation on main stream).
2. **New `inplace_copy_batch_to_gpu()` method**: Implemented in `TrainPipeline` class to handle the new data transfer pattern with proper stream synchronization, using `_to_device()` with the optional `data_copy_stream` parameter.
3. **Extended `Pipelineable.to()` interface**: Added optional `data_copy_stream` parameter to the abstract method, allowing implementations to specify which stream should execute the data copy operation (see #3510).
4. **Updated benchmark configuration** (`sparse_data_dist_base.yml`):
- Increased `num_batches` from 5 to 10
- Changed `feature_pooling_avg` from 10 to 30
- Reduced `num_benchmarks` from 2 to 1
- Added `num_profiles: 1` for profiling
5. **Enhanced table configuration**: Added `base_row_size` parameter (default: 100,000) to `EmbeddingTablesConfig` for more flexible embedding table sizing.
These changes enable performance and memory comparison between standard and in-place copy strategies, with proper benchmarking infrastructure to measure and trace the differences.
Reviewed By: aporialiao
Differential Revision: D86208714
fbshipit-source-id: c7bd9d46d1a9f98a68446b9d4be0f63208b626bf
0 commit comments