You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[monarch_hyperactor] propagate proc status in supervision events for proc failures
Pull Request resolved: #1877
Currently, we report the agent to be Stopped. This is accurate, but confusing, and could be better attributed.
Here, we synthesize an actor failure by:
1) attributing the fault to the corresponding actor in the monitored actor mesh;
2) elevating the proc_status (which contains mode of failure, exit code, etc) into the actor failure, making it clear it is a process failure
In the future:
1) We will have a more general "Failure" struct, explicitly capturing host, proc, actor, etc., failures.
2) We will attribute the actor failure (it is the most proximate), but move the proc failure to the "cause" (i.e., proc failure caused actor to fail), which is the most correct and clear.
ghstack-source-id: 323362899
@exported-using-ghexport
Differential Revision: [D86993889](https://our.internmc.facebook.com/intern/diff/D86993889/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86993889/)!
0 commit comments