Skip to content

Conversation

@Darktex
Copy link
Contributor

@Darktex Darktex commented Nov 8, 2025

Summary

I have been thinking about revising our abstractions to really go all-in on MCP and reduce the friction between Prod and Training. If accepted, our RFCs (and code)
will change meaningfully.

We are not at this stage yet. This PR introduces a single consolidated RFC (env_tools_rfc.md) that proposes significant architectural changes. It's meant to
spark discussion and alignment before we commit to implementation.

Key Changes Proposed

  • Environment = MCP Servers + Simulation Layer: Environment wraps and manages MCP servers directly, with a simulation layer (HTTP) that disappears in production
  • Dual protocol architecture: HTTP for simulation control (reset/step), MCP for agent-tool interaction (present in both training and prod)
  • Tool registry with sim/prod mapping: Explicit registry maps production tools to simulation equivalents, with HF Hub integration for community-contributed
    mappings
  • Git-based checkpointing: Automatic transactional state management with async sidecar for performance
  • Event queue as first-class citizen: Empty queue = static env, populated = dynamic env
  • Data + evals included in Environment: Clear simulation boundary, everything needed for training lives in Environment abstraction

See rfcs/env_tools_rfc.md for full details, rationale, and design decisions.

Progression Plan

  • Phase 1 (Current): Initial proposal as single RFC explaining the delta
  • Phase 2: Update existing RFCs (000-004) to reflect new design decisions
  • Phase 3: Revise project timelines and release plan to accommodate changes

Discussion Points

  1. Tool registry approach and HF Hub integration strategy
  2. Whether to host sim tools directly on HF Hub (currently marked as "under discussion")
  3. Checkpointing transactionality limitations and annotations
  4. Migration path for existing environments

Feedback welcome on all aspects - this is the right time to course-correct before implementation begins.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 8, 2025
@init27
Copy link
Contributor

init27 commented Nov 10, 2025

@Darktex this is really detailed and thorough! Thank you so much.

Instead of Qs, I will detail my thoughts and POV below and how they were answered as I read through this:

Open Qs that I was brainstorming before reading:

  • How to standardize the registery-I was already brainstorming about the Expedia like example before and was wondering if there is a simple way to create a registry when dealing with such examples
  • Same Q above-reducing friction for env builders by having a skeleton that can be-reused which will also be addressed from above.
  • RealTime Vs Simulation Time: I was interested in the "employee" example quoted as well-how do we tackle real time vs sim time.

All of these have been well addressed by the RFC.

Some more Qs I have now:

  • Assuming we are dealing with small model RL which might not be good at tool calling etc and might not need tool calling-are we limiting these to sim envs? (not a pushback, clarifying)

It would be really cool to whiteboard an environment + model training as we start Phase 1

@Darktex
Copy link
Contributor Author

Darktex commented Nov 10, 2025

Thanks for the kind words!

  • Assuming we are dealing with small model RL which might not be good at tool calling etc and might not need tool calling-are we limiting these to sim envs? (not a pushback, clarifying)

I don't think so. At the end of the day, all you are doing in practice is using fastmcp and a @tool decorator on top of Python functions, so it's super lightweight to migrate Python functions to MCP tools. Whether or not the model has been primed for MCP specifically, it will always need a way of listing all the (valid) actions that it can take, and a standardized way of executing them, providing params etc. So in practice it will always need some MCP-like thing.

The only case where this is not true is if your policy is not a LLM but it's instead a "normal" RL policy where your output layer is fixed to have exactly N outputs, where N is the number of legal moves. Something like AlphaGo. But for models like that, OpenEnvs is not gonna be very useful, because you cannot take AlphaGo and start playing Pokemon anyway... So I think that's a reasonable tradeoff to take: assume we build for LLMs.

It would be really cool to whiteboard an environment + model training as we start Phase 1
ACK. Coming up!!!

@jspisak jspisak added enhancement New feature or request RFC labels Nov 11, 2025
@zkwentz
Copy link
Contributor

zkwentz commented Nov 11, 2025

Ack'ing this for now. Led to bigger convo on work chat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. enhancement New feature or request RFC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants