Skip to content

Commit 4e752b2

Browse files
committed
add a decision to RFC about reward computatio
1 parent 0908733 commit 4e752b2

File tree

1 file changed

+31
-2
lines changed

1 file changed

+31
-2
lines changed

rfcs/001-openenv-spec.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,36 @@ These three APIs establish the minimum viable interface for environment interact
156156

157157
**Scope**: This RFC focuses exclusively on these baseline APIs. Additional APIs (e.g., `render()`, `seed()`, `close()`, `tools()` and environment-specific utilities) will be explored in follow-up RFCs.
158158

159-
#### Decision 2: HTTP-Based Communication
159+
#### Decision 2: Environment-Computed Rewards
160+
161+
**Chosen Approach**: Rewards are computed inside the environment and returned as part of the observation.
162+
163+
**Rationale**:
164+
- **Encapsulation**: Reward logic stays with the environment where domain knowledge resides
165+
- **Consistency**: Ensures reward computation is deterministic and reproducible across different client implementations
166+
- **Flexibility**: Environments can use internal state and context not visible to clients for reward computation
167+
- **Standard Pattern**: Aligns with Gymnasium/Gym conventions where rewards are returned from `step()`
168+
169+
The `Observation` base class includes a `reward` field that environments populate:
170+
171+
```python
172+
@dataclass(kw_only=True)
173+
class Observation:
174+
"""Base class for all environment observations."""
175+
done: bool = False
176+
reward: Union[bool, int, float, None] = None
177+
metadata: Dict[str, Any] = field(default_factory=dict)
178+
```
179+
180+
This design enables environments to compute rewards based on:
181+
- Action outcomes (e.g., exit codes, success/failure)
182+
- Internal state transitions
183+
- Multi-step trajectories
184+
- Domain-specific metrics
185+
186+
Clients receive fully-formed observations with rewards already computed, simplifying the client-side RL loop.
187+
188+
#### Decision 3: HTTP-Based Communication
160189

161190
**Chosen Approach**: Use HTTP/REST for client-server communication
162191

@@ -166,7 +195,7 @@ These three APIs establish the minimum viable interface for environment interact
166195
- Supports language-agnostic clients
167196
- FastAPI provides excellent developer experience
168197

169-
#### Decision 3: Docker-Based runtime isolation and packaging
198+
#### Decision 4: Docker-Based runtime isolation and packaging
170199

171200
**Chosen Approach**: Each environment runs in its own Docker container
172201

0 commit comments

Comments
 (0)