diff --git a/docs/source/architecture.md b/docs/source/architecture.md new file mode 100644 index 000000000..b2c3500ff --- /dev/null +++ b/docs/source/architecture.md @@ -0,0 +1,200 @@ +# Architecture + +This guide provides a deep dive into TorchForge's architecture, explaining how Monarch, Services, and TorchStore work together to enable distributed RL. + +## The Foundation: Monarch + +At TorchForge's core is **Monarch**, a PyTorch-native distributed programming framework that brings single-controller orchestration to entire GPU clusters. + +### Single-Controller vs SPMD + +Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with: +- Asynchronous generation and training +- Multiple heterogeneous components (policy, reward model, reference model) +- Dynamic resource allocation +- Fault tolerance across components + +**Monarch's single-controller model** changes this entirely. You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs. + +### Actor Meshes + +Monarch organizes resources into multidimensional arrays called **meshes**: + +**Process Mesh** +: An array of processes spread across many hosts, typically one process per GPU + +**Actor Mesh** +: An array of actors, each running inside a separate process + +Like array programming in NumPy or PyTorch, meshes make it simple to dispatch operations efficiently across large systems. You can slice meshes, broadcast operations, and operate on entire meshes with simple APIs. + +```python +from monarch.actor import Actor, this_host + +# Create a process mesh with 8 processes (one per GPU) +procs = this_host().spawn_procs({"gpus": 8}) + +# Define an actor +class PolicyActor(Actor): + @endpoint + def generate(self, prompt): + return self.model.generate(prompt) + +# Spawn actors across the mesh +actors = procs.spawn("policy", PolicyActor) + +# Call methods on the entire mesh +results = actors.generate.call_all("Hello world") +``` + +### Fault Tolerance + +Monarch provides **progressive fault handling** - you write your code as if nothing fails. When something does fail, Monarch fails fast by default, stopping the whole program like an uncaught exception. + +But you can progressively add fine-grained fault handling exactly where you need it: + +```python +try: + result = await policy.generate.route(prompt) +except ActorFailure: + # Handle failure - maybe retry with different replica + result = await policy.generate.route(prompt) +``` + +For long-running RL training, this is crucial. Hardware failures are common at scale - in Meta's Llama 3 training, there were 419 interruptions across 54 days on a 16K GPU job (roughly one failure every 3 hours). + +### RDMA and Data Plane + +Monarch separates the **control plane** (messaging) from the **data plane** (bulk data transfers). This enables direct GPU-to-GPU memory transfers across your cluster using RDMA (Remote Direct Memory Access). + +Control commands go through one optimized path, while large data transfers (like model weights) go through another path optimized for bandwidth. + +## Services: RL-Friendly Actor Abstraction + +**Services** wrap Monarch's ActorMesh with patterns common in RL. A service is a managed group of actor replicas with built-in load balancing, fault tolerance, and routing primitives. + +```python +# Create a policy service with 16 replicas, each using 8 processes +policy = PolicyActor.options( + procs=8, + with_gpus=True, + num_replicas=16 +).as_service() +``` + +### Service Adverbs + +Services provide intuitive operations called "adverbs": + +**route()** +: Load-balanced request to one replica +```python +response = await policy.generate.route(prompt) +``` + +**fanout()** +: Broadcast to ALL replicas in parallel +```python +await policy.update_weights.fanout(version) +``` + +**session()** +: Sticky sessions for stateful operations (maintains KV cache consistency) +```python +async with policy.session(): + response1 = await policy.generate.route(prompt1) + response2 = await policy.generate.route(prompt2) # Same replica +``` + +### Why Services Matter for RL + +Services solve critical infrastructure challenges: + +**Heterogeneous Scaling** +: Different components need different resources. Your policy might need 16 replicas × 8 processes for high-throughput vLLM inference. Your reward model might need 4 replicas × 4 processes. Your coding environment might need 16 lightweight CPU-only replicas. Services let each component scale independently. + +**Load Balancing** +: In async RL, multiple `continuous_rollouts()` tasks run concurrently. Services automatically distribute these rollouts across available replicas - no manual worker pool management. + +**Fault Tolerance** +: If a replica fails during a rollout, services detect it, mark it unhealthy, and route subsequent requests to healthy replicas. The failed replica gets restarted automatically. Your RL code never sees the failure. + +**Ephemeral Infrastructure** +: Services are created with your job and torn down when finished. Want to try a new reward model? Change your Python code. No standing deployments to maintain, no infrastructure to provision ahead of time. + +## TorchStore: Distributed Weight Storage + +In async RL, every training step produces new policy weights that must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes of data. **TorchStore** makes this efficient. + +### The Weight Synchronization Challenge + +Traditionally, you have two options: +1. **Build complex p2p mappings** between training and inference sharding strategies (fast but extremely complex) +2. **Use network filesystem** like NFS (simple but slow, with high infrastructure cost) + +TorchStore combines the **UX of central storage** with the **performance of in-memory p2p operations**. + +### How TorchStore Works + +TorchStore is a distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives: + +```python +import torchstore as ts +from torch.distributed._tensor import distribute_tensor, Shard +from torch.distributed.device_mesh import init_device_mesh + +# Training process: store sharded weights +async def store_weights(): + device_mesh = init_device_mesh("cuda", (4,)) + tensor = model.state_dict()['layer.weight'] + dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) + + # Each rank stores its shard + await ts.put("policy_weights_v123", dtensor) + +# Inference process: fetch with different sharding +async def load_weights(): + device_mesh = init_device_mesh("cuda", (2, 2)) # Different topology! + tensor = torch.empty_like(model.state_dict()['layer.weight']) + dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) + + # TorchStore handles resharding automatically + await ts.get("policy_weights_v123", dtensor) +``` + +**Key Features:** + +**Automatic Resharding** +: Handles complex weight transfer between different sharding strategies transparently + +**DTensor Native** +: Works seamlessly with PyTorch's distributed tensors + +**RDMA Transfers** +: Uses RDMA for high-bandwidth data movement without blocking GPUs + +**Asynchronous Updates** +: Training and inference can read/write weights independently, enabling true async RL + +**Flexible Storage** +: Store tensors co-located with trainers, on their own storage tier, sharded or replicated - change with minimal code modifications + +### Why TorchStore Matters + +Weight synchronization becomes a bottleneck in async RL. Traditional approaches either: +- Require synchronous GPU-to-GPU transfers (blocking training) +- Use slow network filesystems (minutes per update) +- Demand complex manual resharding logic (error-prone, hard to maintain) + +TorchStore solves all of these, keeping data distributed across the cluster until requested and moving it efficiently with RDMA. + +## Distributed Training Strategies + +TorchForge leverages multiple parallelism strategies through TorchTitan. [See their docs for more](https://github.com/pytorch/torchtitan). + +## See Also + +- {doc}`concepts` - Core philosophy and key abstractions +- {doc}`technology_stack` - Understanding the dependency stack +- {doc}`rl_workflows` - Writing RL algorithms with these components +- {doc}`getting_started` - Installation and setup diff --git a/docs/source/concepts.md b/docs/source/concepts.md new file mode 100644 index 000000000..15fc3cece --- /dev/null +++ b/docs/source/concepts.md @@ -0,0 +1,150 @@ +# Concepts + +This guide introduces the fundamental principles and concepts behind TorchForge, helping you understand the philosophy that drives the system. + +## The Core Philosophy + +TorchForge is built on one principle: **researchers should write algorithms, not infrastructure**. + +The traditional approach to distributed RL requires you to write complex coordination logic, retry mechanisms, resource management, and synchronization code. TorchForge abstracts all of this away, letting you express RL algorithms as naturally as pseudocode while powerful infrastructure handles the distributed complexity underneath. + +## Key Abstractions + +Understanding these core abstractions helps you use TorchForge effectively: + +### Actor + +A component that encapsulates a model along with its execution logic. Actors provide: +- **Isolation**: Independent resources and failure domains +- **Flexibility**: Different parallelism strategies per actor +- **Composability**: Combine actors to create complex pipelines + +### Service + +A managed group of actor replicas with built-in routing, load balancing, and fault tolerance. Services handle operational complexity so your RL code stays clean. Think of services as horizontally scaled actors with automatic load distribution. + +### DTensor (Distributed Tensor) + +A tensor sharded across multiple devices. TorchStore handles resharding DTensors between different topologies automatically, making distributed tensor operations transparent. + +### Episode + +A complete RL interaction sequence containing: +- **Prompt**: Input to the policy +- **Response**: Generated output +- **Reward**: Feedback signal +- **Metadata**: Additional context (timestamps, model versions, etc.) + +Episodes flow through your system from generation to replay buffer to training. + +### Replay Buffer + +Stores episodes for training. Can be implemented with various strategies: +- **FIFO**: Simple queue for on-policy algorithms +- **Prioritized**: Importance sampling for off-policy learning +- **Reservoir**: Uniform sampling from history +- **Hybrid**: Mix multiple strategies + +Integrates with TorchStore for efficient distributed storage. + +## Design Principles + +### Single-Controller Model + +Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with: +- Asynchronous generation and training +- Multiple heterogeneous components (policy, reward model, reference model) +- Dynamic resource allocation +- Fault tolerance across components + +TorchForge adopts **Monarch's single-controller model**: You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs. + +### Composable Components + +Write your core logic once, compose it into any paradigm: +- **Synchronous on-policy** (PPO, GRPO) +- **Asynchronous off-policy** (continuous rollouts + training) +- **Hybrid approaches** (batch collection with async training) + +The same `generate_episode()` function works everywhere. Just change how you compose it. + +### Ephemeral Infrastructure + +Services are created with your job and torn down when finished: +- No standing deployments to maintain +- No infrastructure to provision ahead of time +- Want to try a new reward model? Change your Python code and rerun + +This dramatically reduces operational overhead and enables rapid experimentation. + +### Progressive Fault Tolerance + +Write code as if nothing fails. When failures do occur: +- Monarch fails fast by default (like uncaught exceptions) +- Add fine-grained fault handling exactly where you need it +- Services automatically route around failed replicas +- Failed actors restart automatically + +You choose your fault tolerance granularity based on your needs. + +## Best Practices + +### Model Selection + +- Start with smaller models for prototyping +- Use pre-configured model setups when available +- Validate configurations before large-scale training + +### Data Preparation + +- Ensure balanced and diverse training data +- Implement proper train/validation splits +- Monitor data quality throughout training +- Verify token distributions match expectations + +### Training Strategy + +- Begin with SFT before attempting GRPO +- Use gradient accumulation for larger effective batch sizes +- Monitor KL divergence to prevent policy collapse +- Implement regular checkpointing for fault tolerance +- Apply warmup schedules for stable training + +### Resource Optimization + +- Profile memory usage to identify bottlenecks +- Tune batch sizes for your hardware configuration +- Consider mixed precision training to reduce memory +- Use appropriate parallelism strategies for your model size + +### Debugging + +- Start with single-GPU training to isolate issues +- Enable verbose logging for distributed runs +- Use profiling tools to identify bottlenecks +- Validate data pipelines before full training +- Monitor loss curves and generation quality + +## Validation + +TorchForge has been validated in real-world deployments: + +- **Stanford Collaboration**: Integration with the Weaver weak verifier project, training models that hill-climb on challenging reasoning benchmarks (MATH, GPQA) +- **CoreWeave**: Large-scale training runs on 512 H100 GPU clusters with smooth, efficient performance +- **Scale**: Tested across hundreds of GPUs with continuous rollouts and asynchronous training + +## Learn More + +Dive deeper into specific topics: + +```{toctree} +:maxdepth: 1 + +architecture +technology_stack +rl_workflows +``` + +**Related Documentation:** +- {doc}`getting_started` - Installation and first training run +- {doc}`api` - Complete API reference diff --git a/docs/source/index.md b/docs/source/index.md index 074fa228f..5167ddaaa 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -140,6 +140,15 @@ Installation, prerequisites, verification, and your first training run. **Time to first run: ~15 minutes** ::: +:::{grid-item-card} 🧠 Core Concepts +:link: concepts +:link-type: doc + +Architecture, Monarch integration, Services, TorchStore, and how everything works together. + +**For understanding the system** +::: + :::{grid-item-card} 💻 Tutorials :link: tutorials :link-type: doc @@ -194,6 +203,7 @@ Before starting significant work, signal your intention in the issue tracker to :caption: Documentation getting_started +concepts tutorials api ``` diff --git a/docs/source/rl_workflows.md b/docs/source/rl_workflows.md new file mode 100644 index 000000000..9a828e271 --- /dev/null +++ b/docs/source/rl_workflows.md @@ -0,0 +1,375 @@ +# RL Workflows + +This guide shows you how to write RL algorithms with TorchForge, from simple episode generation to complex asynchronous training loops. + +## Writing RL Algorithms + +With TorchForge's foundations (Monarch, Services, TorchStore), here's what RL code looks like: + +### Episode Generation + +```python +async def generate_episode(dataloader, policy, reward, replay_buffer): + # Sample a prompt + prompt, target = await dataloader.sample.route() + + # Generate response (vLLM handles this efficiently) + response = await policy.generate.route(prompt) + + # Score the response + reward_value = await reward.evaluate_response.route( + prompt=prompt, + response=response.text, + target=target + ) + + # Store for training + await replay_buffer.add.route( + Episode( + prompt_ids=response.prompt_ids, + response_ids=response.token_ids, + reward=reward_value + ) + ) +``` + +Notice what's **not** here: +- No retry logic +- No resource allocation +- No synchronization code +- No infrastructure complexity + +Just your algorithm. + +### Asynchronous RL + +Compose this into fully async, off-policy training: + +```python +async def async_rl_loop(num_rollout_loops: int): + # Multiple concurrent rollout generators + rollout_tasks = [ + asyncio.create_task(continuous_rollouts()) + for _ in range(num_rollout_loops) + ] + + # Continuous training + training_task = asyncio.create_task(continuous_training()) + + await asyncio.gather(*rollout_tasks, training_task) + +async def continuous_rollouts(): + """Generate rollouts continuously using latest policy.""" + while True: + await generate_episode(dataloader, policy, reward, replay_buffer) + +async def continuous_training(): + """Train continuously on available experience.""" + training_step = 0 + while True: + batch = await replay_buffer.sample.route( + curr_policy_version=training_step + ) + + if batch is None: + await asyncio.sleep(0.1) # Wait for more experience + else: + loss = await trainer.train_step.route(batch) + training_step += 1 + + # Push updated weights (TorchStore handles this) + await trainer.push_weights.route(training_step) + # Broadcast to all policy replicas + await policy.update_weights.fanout(training_step) +``` + +### Synchronous RL + +The same `generate_episode()` function works for on-policy algorithms like PPO - just compose it differently: + +```python +async def synchronous_rl(batch_size: int): + """Synchronous on-policy RL: collect batch, then train.""" + version = 0 + + while True: + # Collect a full batch with current policy version + for _ in range(batch_size): + await generate_episode(dataloader, policy, reward, replay_buffer) + + # Sample the batch we just collected + batch = await replay_buffer.sample.route( + curr_policy_version=version, + batch_size=batch_size + ) + + # Train on the complete batch + loss = await trainer.train_step.route(batch) + + # Update weights in lockstep + await trainer.push_weights.route(version + 1) + await policy.update_weights.fanout(version + 1) + version += 1 +``` + +**The Power of Composition**: Write your rollout logic once, compose it into any paradigm - on-policy, off-policy, or anywhere in between. + +## Extensible Environments + +RL often requires interacting with environments beyond text generation - executing code, using tools, running simulations. TorchForge makes these first-class citizens through the same service abstraction. + +### Code Execution + +For RL on coding tasks (RLVR - Reinforcement Learning with Verifiable Rewards): + +```python +# Lightweight CPU-only service for parallel execution +coder = SandboxedPythonCoder.options( + procs=1, + with_gpus=False, + num_replicas=16 +).as_service() + +# In your RL code +async def generate_episode(): + prompt = await dataloader.sample.route() + code = await policy.generate.route(prompt) + + # Execute safely in sandbox + stdout, stderr = await coder.execute.route(code) + reward = 1.0 if stderr == "" else 0.0 # Reward based on execution + + await replay_buffer.add.route(Episode(...)) +``` + +### Tool Integration + +Services make tools ephemeral - spawn them with your job, scale them independently, tear down when finished. The same coordination primitives work for any environment type. + +```python +# Create a web browsing environment +browser = WebBrowsingEnv.options( + procs=1, + with_gpus=False, + num_replicas=8 +).as_service() + +# Use it in your RL loop +async def generate_episode(): + task = await dataloader.sample.route() + + # Agent decides on actions + action = await policy.generate.route(task) + + # Execute action in browser + result = await browser.execute_action.route(action) + + # Evaluate outcome + reward = await reward_model.evaluate.route(task, result) + + await replay_buffer.add.route(Episode(...)) +``` + +This pattern extends naturally to **agentic workflows** - agents that interact with tools, query APIs, and navigate complex environments while learning from outcomes. + +### Custom Environments + +Build your own environment service: + +```python +from monarch.actor import Actor, endpoint + +class CustomEnv(Actor): + def __init__(self): + # Initialize your environment + self.state = self.reset() + + @endpoint + async def reset(self): + """Reset environment to initial state.""" + return initial_state + + @endpoint + async def step(self, action): + """Execute action and return (observation, reward, done).""" + # Your environment logic here + return observation, reward, done + +# Deploy as a service +env = CustomEnv.options( + procs=1, + num_replicas=16 +).as_service() + +# Use in training +obs = await env.reset.route() +while not done: + action = await policy.act.route(obs) + obs, reward, done = await env.step.route(action) +``` + +## Common Patterns + +### Warmup Phase + +Start training after collecting initial experience: + +```python +async def warmup_then_train(warmup_episodes: int): + # Collect initial experience + for _ in range(warmup_episodes): + await generate_episode(dataloader, policy, reward, replay_buffer) + + # Now start training + await continuous_training() +``` + +### Evaluation Episodes + +Interleave evaluation with training: + +```python +async def train_with_eval(eval_interval: int): + training_step = 0 + + while True: + # Training phase + for _ in range(eval_interval): + await generate_episode(dataloader, policy, reward, replay_buffer) + batch = await replay_buffer.sample.route() + await trainer.train_step.route(batch) + training_step += 1 + + # Evaluation phase + eval_rewards = [] + for _ in range(100): + episode = await generate_episode( + eval_dataloader, policy, reward, None # Don't store in buffer + ) + eval_rewards.append(episode.reward) + + print(f"Step {training_step}: Eval reward = {np.mean(eval_rewards)}") +``` + +### Curriculum Learning + +Gradually increase task difficulty: + +```python +async def curriculum_training(): + difficulty = 0 + + while difficulty < max_difficulty: + # Train on current difficulty + for _ in range(episodes_per_difficulty): + prompt = await dataloader.sample.route(difficulty=difficulty) + await generate_episode_with_prompt(prompt, policy, reward, replay_buffer) + + # Evaluate performance + success_rate = await evaluate(policy, difficulty) + + # Move to next difficulty if threshold met + if success_rate > 0.8: + difficulty += 1 + print(f"Advanced to difficulty {difficulty}") +``` + +### Multi-Task Training + +Train on multiple tasks simultaneously: + +```python +async def multi_task_training(tasks: List[str]): + # Create separate dataloaders for each task + dataloaders = {task: create_dataloader(task) for task in tasks} + + while True: + # Sample task uniformly (or with custom distribution) + task = random.choice(tasks) + dataloader = dataloaders[task] + + # Generate episode for this task + await generate_episode(dataloader, policy, reward, replay_buffer) + + # Train uses mixed experience from all tasks + batch = await replay_buffer.sample.route() + await trainer.train_step.route(batch) +``` + +## Debugging Tips + +### Start Small + +Begin with a minimal setup to validate your logic: + +```python +# Single GPU, single replica, synchronous +policy = PolicyActor.options(procs=1, with_gpus=True).as_service() +reward = RewardActor.options(procs=1, with_gpus=True).as_service() + +# Run a few episodes +for _ in range(10): + await generate_episode(dataloader, policy, reward, replay_buffer) +``` + +Once this works, scale up to multi-GPU and async training. + +### Add Logging + +Insert logging at key points: + +```python +async def generate_episode(dataloader, policy, reward, replay_buffer): + start = time.time() + + prompt, target = await dataloader.sample.route() + print(f"Sampled prompt in {time.time() - start:.2f}s") + + gen_start = time.time() + response = await policy.generate.route(prompt) + print(f"Generated response in {time.time() - gen_start:.2f}s") + + reward_start = time.time() + reward_value = await reward.evaluate_response.route(prompt, response.text, target) + print(f"Computed reward in {time.time() - reward_start:.2f}s") + + await replay_buffer.add.route(Episode(...)) + print(f"Total episode time: {time.time() - start:.2f}s") +``` + +### Monitor Metrics + +Track key metrics: + +```python +from collections import deque + +recent_rewards = deque(maxlen=100) +recent_kls = deque(maxlen=100) + +async def continuous_training(): + training_step = 0 + + while True: + batch = await replay_buffer.sample.route() + if batch: + loss, kl = await trainer.train_step.route(batch) + recent_kls.append(kl) + + if training_step % 100 == 0: + print(f"Step {training_step}") + print(f" Avg reward: {np.mean(recent_rewards):.3f}") + print(f" Avg KL: {np.mean(recent_kls):.3f}") + print(f" Loss: {loss:.3f}") + + training_step += 1 +``` + +## See Also + +- {doc}`concepts` - Core philosophy and abstractions +- {doc}`architecture` - How Services and TorchStore enable these patterns +- {doc}`technology_stack` - Understanding the underlying components +- {doc}`usage` - Configuration and practical examples +- {doc}`tutorials` - Step-by-step guides +- {doc}`api` - Complete API reference diff --git a/docs/source/technology_stack.md b/docs/source/technology_stack.md new file mode 100644 index 000000000..1651f26c7 --- /dev/null +++ b/docs/source/technology_stack.md @@ -0,0 +1,120 @@ +# Technology Stack + +TorchForge is built on a carefully curated stack of battle-tested components, each solving specific challenges in distributed RL. Understanding this stack helps you troubleshoot issues, optimize performance, and customize your setup. + +## Monarch: The Distributed Foundation + +**What it is:** Monarch is a PyTorch-native distributed programming framework that brings single-controller orchestration to entire clusters. It's implemented with a Python frontend and Rust backend for performance and robustness. + +**Why TorchForge needs it:** +- **Single-Controller Model**: Write code that looks like a single Python program but scales to thousands of GPUs +- **Actor Meshes**: Organize processes and actors into scalable, multi-dimensional arrays +- **Fault Tolerance**: Progressive fault handling with fast failure detection and recovery +- **RDMA Support**: Direct GPU-to-GPU memory transfers for efficient data movement + +**What it solves:** Traditional SPMD (Single Program, Multiple Data) approaches require complex coordination logic in your code. Monarch abstracts this away, letting you write RL algorithms naturally while it handles distributed complexity underneath. + +**Technical capabilities:** +- Scalable messaging with multicast trees +- Multipart messaging for zero-copy data transfers +- Integration with PyTorch's distributed primitives +- Separation of control plane (messaging) and data plane (bulk transfers) + +**Where you see it:** Every service creation, actor spawn, and distributed operation in TorchForge runs on Monarch primitives. It's the invisible orchestration layer that makes distributed RL feel simple. + +## vLLM: High-Performance Inference + +**What it is:** A fast and memory-efficient inference engine optimized for large language models, version 0.10.0 or higher required. + +**Why TorchForge needs it:** +- **PagedAttention**: Memory-efficient attention mechanism that reduces memory fragmentation +- **Continuous Batching**: Dynamic batching that maximizes GPU utilization +- **High Throughput**: Handles generation for multiple concurrent rollouts efficiently +- **Production-Ready**: Battle-tested at scale with proven performance + +**What it solves:** In RL for LLMs, policy generation is often the bottleneck. Autoregressive generation is costly, and blocking training on it kills throughput. vLLM enables fast, efficient inference that doesn't bottleneck your training loop. + +**Integration depth:** TorchForge integrates directly with vLLM's engine, giving you access to customize generation strategies, memory management, and inference logic as your research demands. You control the vLLM configuration while TorchForge handles distributed orchestration. + +**Where you see it:** Every policy generation call in your RL code runs through vLLM, whether you're doing synchronous PPO-style rollouts or fully asynchronous off-policy training. + +## TorchTitan: Production Training Infrastructure + +**What it is:** Meta's production-grade LLM training platform with advanced parallelism support, used to train Llama models on thousands of GPUs. + +**Why TorchForge needs it:** +- **FSDP (Fully Sharded Data Parallel)**: Shard parameters, gradients, and optimizer states across GPUs +- **Pipeline Parallelism**: Split model layers across devices with efficient micro-batching +- **Tensor Parallelism**: Split individual tensors across devices for very large models +- **Proven at Scale**: Battle-tested optimizations from production training runs + +**What it solves:** Modern models are too large to fit on single GPUs. TorchTitan provides the sophisticated sharding and parallelism strategies needed to train them efficiently, with optimizations proven in production environments. + +**Integration depth:** TorchForge provides direct access to TorchTitan's training step logic and sharding strategies, enabling experimentation without framework constraints. You can customize the training loop while leveraging TorchTitan's proven infrastructure. + +**Where you see it:** Policy training, whether supervised fine-tuning or RL policy updates, runs through TorchTitan's training infrastructure with your choice of parallelism strategies. + +## TorchStore: Distributed Weight Storage + +**What it is:** A distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives, designed specifically for weight synchronization in distributed RL. + +**Why TorchForge needs it:** +- **Automatic Resharding**: Handles complex weight transfer between different sharding strategies +- **DTensor Support**: Native support for distributed tensors with automatic topology conversion +- **RDMA Transfers**: High-bandwidth weight movement without synchronous GPU transfers +- **Asynchronous Updates**: Training and inference can read/write weights independently + +**What it solves:** In async RL, new policy weights must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes. Traditional approaches either require synchronous GPU-to-GPU transfers (blocking training), use slow network filesystems (minutes per update), or demand complex manual resharding logic (error-prone). TorchStore solves all of these. + +**Technical capabilities:** +- Simple key-value interface with complex optimizations underneath +- Stays distributed across the cluster until requested +- Flexible storage: co-located with trainers, on storage tier, sharded or replicated + +**Where you see it:** Weight synchronization between training and inference, allowing training to continue while inference replicas asynchronously fetch updated weights without blocking either process. + +## PyTorch Nightly: Cutting-Edge Features + +**Why Nightly is required:** TorchForge leverages the latest PyTorch features that aren't yet in stable releases: +- **Native DTensor Support**: Distributed tensors that span multiple devices with automatic sharding +- **Compiled Mode Optimizations**: Performance improvements through torch.compile +- **Advanced Memory Management**: Latest FSDP and memory optimization features +- **Bug Fixes**: Continuous improvements to distributed training primitives + +**Where you see it:** Every tensor operation, distributed primitive, and training optimization builds on PyTorch nightly's latest capabilities. + +## The Integration Philosophy + +TorchForge made a conscious decision not to reinvent the wheel. Instead, we integrate battle-tested components and add the coordination layer that makes them work together seamlessly. + +**What you get:** +- **Direct component access**: Customize deeply when your research demands it +- **Proven performance**: Battle-tested at massive scale in production environments +- **Flexible composition**: Mix and match components or replace them with custom implementations +- **Simplified orchestration**: TorchForge coordinates these components so you write algorithms, not infrastructure + +**TorchForge's role:** Coordination. We make these powerful but complex components work together seamlessly, exposing simple APIs for distributed RL while preserving deep customization capabilities when you need them. + +## Installation + +All these components are installed automatically through TorchForge's installation script: + +```bash +git clone https://github.com/meta-pytorch/forge.git +cd forge +conda create -n forge python=3.10 +conda activate forge +./scripts/install.sh +``` + +The script provides pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan, ensuring compatibility and reducing installation time. + +See {doc}`getting_started` for detailed installation instructions and troubleshooting. + +## See Also + +- {doc}`concepts` - Core philosophy and key abstractions +- {doc}`architecture` - How Monarch, Services, and TorchStore work together +- {doc}`rl_workflows` - Using these components to write RL algorithms +- {doc}`getting_started` - Installation and setup guide +- {doc}`usage` - Practical configuration examples