Skip to content

Conversation

@moonli
Copy link
Contributor

@moonli moonli commented Nov 13, 2025

Summary:
right now, when anything related to mesh agent happens, e.g. connection down, proc stopped, we print supervision error message like

monarch._rust_bindings.monarch_hyperactor.supervision.SupervisionError: Actor metatls:twshared12411.02.gtn2.facebook.com:36037,anon_0_15HL6RLpNvRw,agent[0] exited because of the following reason: <PyActorSupervisionEvent: metatls:twshared12411.02.gtn2.facebook.com:36037,anon_0_15HL6RLpNvRw,agent[0]: stopped at 2025-11-11 17:31:58.907237439 -08:00>

This message contains the actor mesh agent, which is monarch internal actor, should not be exposed to customers. This message would confuse users, thinking it is always something related to monarch internal.

This diff changes the message to be more explicit for user what the next step is for investigation. New log for agent related error will look like

twshared234234.gtn3:23232 is not reacheable, check the log on the host for details

A followup diff will include a scuba link as part of the message, the scuba will show monarch error log and stderr error logs.

Differential Revision: D86984496

Summary:
right now, when anything related to mesh agent happens, e.g. connection down, proc stopped, we print supervision error message like

```
monarch._rust_bindings.monarch_hyperactor.supervision.SupervisionError: Actor metatls:twshared12411.02.gtn2.facebook.com:36037,anon_0_15HL6RLpNvRw,agent[0] exited because of the following reason: <PyActorSupervisionEvent: metatls:twshared12411.02.gtn2.facebook.com:36037,anon_0_15HL6RLpNvRw,agent[0]: stopped at 2025-11-11 17:31:58.907237439 -08:00>
```

This message contains the actor mesh agent, which is monarch internal actor, should not be exposed to customers. This message would confuse users, thinking it is always something related to monarch internal.

This diff changes the message to be more explicit for user what the next step is for investigation. New log for agent related error will look like

```
twshared234234.gtn3:23232 is not reacheable, check the log on the host for details
```

A followup diff will include a scuba link as part of the message, the scuba will show monarch error log and stderr error logs.

Differential Revision: D86984496
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 13, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 13, 2025

@moonli has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86984496.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant