Skip to content

Conversation

@zdevito
Copy link
Contributor

@zdevito zdevito commented Nov 13, 2025

zdevito added a commit that referenced this pull request Nov 13, 2025
Differential Revision: [D86925582](https://our.internmc.facebook.com/intern/diff/D86925582/)

ghstack-source-id: 323132637
Pull Request resolved: #1881
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 13, 2025
zdevito added a commit that referenced this pull request Nov 14, 2025
Pull Request resolved: #1881

This changes the ActorSupervisionEvent structure so that we preserve enough information to give a good error message when an actor fails.

The major changes are
* removing jargon `processing error: superivision: `.
* Adding user-understandable actor names.
* identifying the actual actor that failed, and summarizing the default chain handling so that there are almost no wrappers around the error.

Here are some examples of what it looks like now:

When an actor directly errors:
```
    I AM ABOUT TO ERROR!!!!
    Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Lambda actor> and all its descendants have failed.
    This occurred because the actor itself failed.
    The error was:
     Traceback (most recent call last):
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
         response_port.exception(ActorError(e))
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
         raise obj from None
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
         result = the_method(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
         return l()
                ^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
         raise ValueError("Error.")
     ValueError: Error.
```

When a nested actor errors:
```
python/tests/test_supervision_hierarchy.py::test_nested_mesh_kills_actor_actor_error Monarch internal logs are being written to /tmp/zdevito/monarch_log.log
ERRORED THE ACTOR
I AM ABOUT TO ERROR!!!!
Nest still alive 0
Nest still alive 1
Nest still alive 2
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor <root>.<tests.test_supervision_hierarchy.Nest actor>.<tests.test_supervision_hierarchy.Lambda nested> failed.
The error was:
 Traceback (most recent call last):
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
     response_port.exception(ActorError(e))
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
     raise obj from None
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
     result = the_method(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
     return l()
            ^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
     raise ValueError("Error.")
 ValueError: Error.
```

When a proc errors:
```
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor unix:@eRu5gzLrP1kdciNpAErvY1Q9,anon_0_16inPfUmdpwZ,agent[0] failed.
The error was:
 The process unix:@eRu5gzLrP1kdciNpAErvY1Q9 owned by this actor became unresponsive and is assumed dead, check the log on the host for details
```

The proc error includes the changes added in D86984496 to make agent failures more clean. We should eventually further improve this by making sure we generate a supervision event specific to process failure as noticed by the host agent. That should include a friendly name for the process (the processes name given during spawn, and its owning actor).

.

ghstack-source-id: 323410911

Differential Revision: [D86925582](https://our.internmc.facebook.com/intern/diff/D86925582/)
zdevito added a commit that referenced this pull request Nov 14, 2025
Pull Request resolved: #1881

This changes the ActorSupervisionEvent structure so that we preserve enough information to give a good error message when an actor fails.

The major changes are
* removing jargon `processing error: superivision: `.
* Adding user-understandable actor names.
* identifying the actual actor that failed, and summarizing the default chain handling so that there are almost no wrappers around the error.

Here are some examples of what it looks like now:

When an actor directly errors:
```
    I AM ABOUT TO ERROR!!!!
    Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Lambda actor> and all its descendants have failed.
    This occurred because the actor itself failed.
    The error was:
     Traceback (most recent call last):
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
         response_port.exception(ActorError(e))
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
         raise obj from None
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
         result = the_method(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
         return l()
                ^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
         raise ValueError("Error.")
     ValueError: Error.
```

When a nested actor errors:
```
python/tests/test_supervision_hierarchy.py::test_nested_mesh_kills_actor_actor_error Monarch internal logs are being written to /tmp/zdevito/monarch_log.log
ERRORED THE ACTOR
I AM ABOUT TO ERROR!!!!
Nest still alive 0
Nest still alive 1
Nest still alive 2
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor <root>.<tests.test_supervision_hierarchy.Nest actor>.<tests.test_supervision_hierarchy.Lambda nested> failed.
The error was:
 Traceback (most recent call last):
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
     response_port.exception(ActorError(e))
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
     raise obj from None
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
     result = the_method(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
     return l()
            ^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
     raise ValueError("Error.")
 ValueError: Error.
```

When a proc errors:
```
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor unix:@eRu5gzLrP1kdciNpAErvY1Q9,anon_0_16inPfUmdpwZ,agent[0] failed.
The error was:
 The process unix:@eRu5gzLrP1kdciNpAErvY1Q9 owned by this actor became unresponsive and is assumed dead, check the log on the host for details
```

The proc error includes the changes added in D86984496 to make agent failures more clean. We should eventually further improve this by making sure we generate a supervision event specific to process failure as noticed by the host agent. That should include a friendly name for the process (the processes name given during spawn, and its owning actor).

.

ghstack-source-id: 323426577

Differential Revision: [D86925582](https://our.internmc.facebook.com/intern/diff/D86925582/)
zdevito added a commit that referenced this pull request Nov 14, 2025
Pull Request resolved: #1881

This changes the ActorSupervisionEvent structure so that we preserve enough information to give a good error message when an actor fails.

The major changes are
* removing jargon `processing error: superivision: `.
* Adding user-understandable actor names.
* identifying the actual actor that failed, and summarizing the default chain handling so that there are almost no wrappers around the error.

Here are some examples of what it looks like now:

When an actor directly errors:
```
    I AM ABOUT TO ERROR!!!!
    Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Lambda actor> and all its descendants have failed.
    This occurred because the actor itself failed.
    The error was:
     Traceback (most recent call last):
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
         response_port.exception(ActorError(e))
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
         raise obj from None
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
         result = the_method(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
         return l()
                ^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
         raise ValueError("Error.")
     ValueError: Error.
```

When a nested actor errors:
```
python/tests/test_supervision_hierarchy.py::test_nested_mesh_kills_actor_actor_error Monarch internal logs are being written to /tmp/zdevito/monarch_log.log
ERRORED THE ACTOR
I AM ABOUT TO ERROR!!!!
Nest still alive 0
Nest still alive 1
Nest still alive 2
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor <root>.<tests.test_supervision_hierarchy.Nest actor>.<tests.test_supervision_hierarchy.Lambda nested> failed.
The error was:
 Traceback (most recent call last):
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
     response_port.exception(ActorError(e))
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
     raise obj from None
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
     result = the_method(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
     return l()
            ^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
     raise ValueError("Error.")
 ValueError: Error.
```

When a proc errors:
```
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor unix:@eRu5gzLrP1kdciNpAErvY1Q9,anon_0_16inPfUmdpwZ,agent[0] failed.
The error was:
 The process unix:@eRu5gzLrP1kdciNpAErvY1Q9 owned by this actor became unresponsive and is assumed dead, check the log on the host for details
```

The proc error includes the changes added in D86984496 to make agent failures more clean. We should eventually further improve this by making sure we generate a supervision event specific to process failure as noticed by the host agent. That should include a friendly name for the process (the processes name given during spawn, and its owning actor).

.

ghstack-source-id: 323448669

Differential Revision: [D86925582](https://our.internmc.facebook.com/intern/diff/D86925582/)
zdevito added a commit that referenced this pull request Nov 15, 2025
Pull Request resolved: #1881

This changes the ActorSupervisionEvent structure so that we preserve enough information to give a good error message when an actor fails.

The major changes are
* removing jargon `processing error: superivision: `.
* Adding user-understandable actor names.
* identifying the actual actor that failed, and summarizing the default chain handling so that there are almost no wrappers around the error.

Here are some examples of what it looks like now:

When an actor directly errors:
```
    I AM ABOUT TO ERROR!!!!
    Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Lambda actor> and all its descendants have failed.
    This occurred because the actor itself failed.
    The error was:
     Traceback (most recent call last):
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
         response_port.exception(ActorError(e))
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
         raise obj from None
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
         result = the_method(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
         return l()
                ^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
         raise ValueError("Error.")
     ValueError: Error.
```

When a nested actor errors:
```
python/tests/test_supervision_hierarchy.py::test_nested_mesh_kills_actor_actor_error Monarch internal logs are being written to /tmp/zdevito/monarch_log.log
ERRORED THE ACTOR
I AM ABOUT TO ERROR!!!!
Nest still alive 0
Nest still alive 1
Nest still alive 2
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor <root>.<tests.test_supervision_hierarchy.Nest actor>.<tests.test_supervision_hierarchy.Lambda nested> failed.
The error was:
 Traceback (most recent call last):
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
     response_port.exception(ActorError(e))
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
     raise obj from None
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
     result = the_method(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
     return l()
            ^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
     raise ValueError("Error.")
 ValueError: Error.
```

When a proc errors:
```
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor unix:@eRu5gzLrP1kdciNpAErvY1Q9,anon_0_16inPfUmdpwZ,agent[0] failed.
The error was:
 The process unix:@eRu5gzLrP1kdciNpAErvY1Q9 owned by this actor became unresponsive and is assumed dead, check the log on the host for details
```

The proc error includes the changes added in D86984496 to make agent failures more clean. We should eventually further improve this by making sure we generate a supervision event specific to process failure as noticed by the host agent. That should include a friendly name for the process (the processes name given during spawn, and its owning actor).

.

ghstack-source-id: 323464467

Differential Revision: [D86925582](https://our.internmc.facebook.com/intern/diff/D86925582/)
zdevito added a commit that referenced this pull request Nov 15, 2025
Pull Request resolved: #1881

This changes the ActorSupervisionEvent structure so that we preserve enough information to give a good error message when an actor fails.

The major changes are
* removing jargon `processing error: superivision: `.
* Adding user-understandable actor names.
* identifying the actual actor that failed, and summarizing the default chain handling so that there are almost no wrappers around the error.

Here are some examples of what it looks like now:

When an actor directly errors:
```
    I AM ABOUT TO ERROR!!!!
    Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Lambda actor> and all its descendants have failed.
    This occurred because the actor itself failed.
    The error was:
     Traceback (most recent call last):
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
         response_port.exception(ActorError(e))
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
         raise obj from None
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
         result = the_method(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
         return l()
                ^^^
       File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
         raise ValueError("Error.")
     ValueError: Error.
```

When a nested actor errors:
```
python/tests/test_supervision_hierarchy.py::test_nested_mesh_kills_actor_actor_error Monarch internal logs are being written to /tmp/zdevito/monarch_log.log
ERRORED THE ACTOR
I AM ABOUT TO ERROR!!!!
Nest still alive 0
Nest still alive 1
Nest still alive 2
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor <root>.<tests.test_supervision_hierarchy.Nest actor>.<tests.test_supervision_hierarchy.Lambda nested> failed.
The error was:
 Traceback (most recent call last):
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
     response_port.exception(ActorError(e))
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
     raise obj from None
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
     result = the_method(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
     return l()
            ^^^
   File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
     raise ValueError("Error.")
 ValueError: Error.
```

When a proc errors:
```
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor unix:@eRu5gzLrP1kdciNpAErvY1Q9,anon_0_16inPfUmdpwZ,agent[0] failed.
The error was:
 The process unix:@eRu5gzLrP1kdciNpAErvY1Q9 owned by this actor became unresponsive and is assumed dead, check the log on the host for details
```

The proc error includes the changes added in D86984496 to make agent failures more clean. We should eventually further improve this by making sure we generate a supervision event specific to process failure as noticed by the host agent. That should include a friendly name for the process (the processes name given during spawn, and its owning actor).

.

ghstack-source-id: 323494574

Differential Revision: [D86925582](https://our.internmc.facebook.com/intern/diff/D86925582/)
@meta-codesync meta-codesync bot closed this in a1e5e4c Nov 15, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 15, 2025

This pull request has been merged in a1e5e4c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants