Commit edea1bf
committed
Reformat ActorSupervisionEvent
Pull Request resolved: #1881
This changes the ActorSupervisionEvent structure so that we preserve enough information to give a good error message when an actor fails.
The major changes are
* removing jargon `processing error: superivision: `.
* Adding user-understandable actor names.
* identifying the actual actor that failed, and summarizing the default chain handling so that there are almost no wrappers around the error.
Here are some examples of what it looks like now:
When an actor directly errors:
```
I AM ABOUT TO ERROR!!!!
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Lambda actor> and all its descendants have failed.
This occurred because the actor itself failed.
The error was:
Traceback (most recent call last):
File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
response_port.exception(ActorError(e))
File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
raise obj from None
File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
result = the_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
return l()
^^^
File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
raise ValueError("Error.")
ValueError: Error.
```
When a nested actor errors:
```
python/tests/test_supervision_hierarchy.py::test_nested_mesh_kills_actor_actor_error Monarch internal logs are being written to /tmp/zdevito/monarch_log.log
ERRORED THE ACTOR
I AM ABOUT TO ERROR!!!!
Nest still alive 0
Nest still alive 1
Nest still alive 2
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor <root>.<tests.test_supervision_hierarchy.Nest actor>.<tests.test_supervision_hierarchy.Lambda nested> failed.
The error was:
Traceback (most recent call last):
File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle
response_port.exception(ActorError(e))
File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception
raise obj from None
File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle
result = the_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run
return l()
^^^
File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error
raise ValueError("Error.")
ValueError: Error.
```
When a proc errors:
```
Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed.
This occurred because the actor unix:@eRu5gzLrP1kdciNpAErvY1Q9,anon_0_16inPfUmdpwZ,agent[0] failed.
The error was:
The process unix:@eRu5gzLrP1kdciNpAErvY1Q9 owned by this actor became unresponsive and is assumed dead, check the log on the host for details
```
The proc error includes the changes added in D86984496 to make agent failures more clean. We should eventually further improve this by making sure we generate a supervision event specific to process failure as noticed by the host agent. That should include a friendly name for the process (the processes name given during spawn, and its owning actor).
.
ghstack-source-id: 323410911
Differential Revision: [D86925582](https://our.internmc.facebook.com/intern/diff/D86925582/)1 parent 0077e9c commit edea1bf
File tree
17 files changed
+302
-188
lines changed- hyperactor_mesh/src
- v1
- hyperactor_multiprocess/src
- hyperactor/src
- mailbox
- monarch_hyperactor/src
- v1
- python
- monarch
- _rust_bindings/monarch_hyperactor
- _src/actor
- tests
17 files changed
+302
-188
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
153 | 160 | | |
154 | 161 | | |
155 | 162 | | |
| |||
340 | 347 | | |
341 | 348 | | |
342 | 349 | | |
343 | | - | |
344 | | - | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
345 | 355 | | |
346 | 356 | | |
347 | 357 | | |
348 | 358 | | |
349 | 359 | | |
350 | 360 | | |
351 | 361 | | |
352 | | - | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
353 | 368 | | |
354 | 369 | | |
355 | 370 | | |
| |||
434 | 449 | | |
435 | 450 | | |
436 | 451 | | |
437 | | - | |
438 | | - | |
439 | | - | |
440 | | - | |
441 | | - | |
442 | | - | |
443 | | - | |
444 | | - | |
445 | | - | |
446 | 452 | | |
447 | 453 | | |
448 | 454 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3143 | 3143 | | |
3144 | 3144 | | |
3145 | 3145 | | |
3146 | | - | |
| 3146 | + | |
3147 | 3147 | | |
3148 | 3148 | | |
3149 | 3149 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
165 | 165 | | |
166 | 166 | | |
167 | 167 | | |
| 168 | + | |
168 | 169 | | |
169 | 170 | | |
170 | | - | |
171 | 171 | | |
172 | 172 | | |
173 | 173 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1118 | 1118 | | |
1119 | 1119 | | |
1120 | 1120 | | |
1121 | | - | |
1122 | 1121 | | |
1123 | 1122 | | |
1124 | 1123 | | |
| |||
1131 | 1130 | | |
1132 | 1131 | | |
1133 | 1132 | | |
| 1133 | + | |
1134 | 1134 | | |
1135 | 1135 | | |
1136 | | - | |
1137 | 1136 | | |
1138 | 1137 | | |
1139 | 1138 | | |
| |||
1321 | 1320 | | |
1322 | 1321 | | |
1323 | 1322 | | |
1324 | | - | |
| 1323 | + | |
| 1324 | + | |
| 1325 | + | |
| 1326 | + | |
| 1327 | + | |
1325 | 1328 | | |
1326 | 1329 | | |
1327 | 1330 | | |
| |||
1381 | 1384 | | |
1382 | 1385 | | |
1383 | 1386 | | |
1384 | | - | |
1385 | | - | |
1386 | | - | |
1387 | | - | |
1388 | | - | |
1389 | | - | |
1390 | | - | |
1391 | | - | |
| 1387 | + | |
| 1388 | + | |
1392 | 1389 | | |
1393 | 1390 | | |
1394 | 1391 | | |
1395 | 1392 | | |
1396 | | - | |
1397 | | - | |
1398 | | - | |
1399 | | - | |
1400 | | - | |
1401 | | - | |
1402 | | - | |
1403 | | - | |
| 1393 | + | |
| 1394 | + | |
| 1395 | + | |
1404 | 1396 | | |
1405 | | - | |
| 1397 | + | |
1406 | 1398 | | |
1407 | 1399 | | |
1408 | 1400 | | |
| |||
2892 | 2884 | | |
2893 | 2885 | | |
2894 | 2886 | | |
2895 | | - | |
2896 | | - | |
2897 | | - | |
2898 | | - | |
2899 | | - | |
2900 | | - | |
2901 | | - | |
2902 | | - | |
2903 | | - | |
2904 | | - | |
2905 | | - | |
2906 | | - | |
2907 | | - | |
2908 | | - | |
2909 | | - | |
2910 | | - | |
2911 | | - | |
2912 | | - | |
2913 | | - | |
2914 | | - | |
2915 | | - | |
2916 | | - | |
2917 | | - | |
2918 | | - | |
2919 | | - | |
2920 | | - | |
2921 | | - | |
2922 | | - | |
2923 | | - | |
2924 | | - | |
2925 | | - | |
2926 | | - | |
2927 | | - | |
2928 | | - | |
2929 | | - | |
2930 | | - | |
2931 | | - | |
2932 | | - | |
2933 | | - | |
2934 | | - | |
2935 | | - | |
2936 | | - | |
2937 | | - | |
2938 | | - | |
2939 | | - | |
2940 | | - | |
2941 | | - | |
2942 | | - | |
2943 | | - | |
2944 | | - | |
2945 | | - | |
2946 | | - | |
2947 | | - | |
2948 | | - | |
2949 | | - | |
2950 | | - | |
2951 | | - | |
2952 | | - | |
2953 | | - | |
2954 | | - | |
2955 | | - | |
2956 | | - | |
2957 | | - | |
2958 | | - | |
2959 | | - | |
2960 | | - | |
2961 | | - | |
2962 | | - | |
2963 | | - | |
2964 | | - | |
2965 | | - | |
2966 | | - | |
2967 | | - | |
2968 | | - | |
2969 | | - | |
2970 | | - | |
2971 | | - | |
2972 | | - | |
2973 | | - | |
2974 | | - | |
2975 | | - | |
2976 | | - | |
2977 | | - | |
2978 | | - | |
2979 | | - | |
2980 | | - | |
2981 | | - | |
2982 | | - | |
2983 | | - | |
2984 | | - | |
2985 | | - | |
2986 | | - | |
2987 | | - | |
2988 | | - | |
2989 | | - | |
2990 | | - | |
2991 | | - | |
2992 | | - | |
2993 | | - | |
2994 | | - | |
2995 | | - | |
2996 | 2887 | | |
2997 | 2888 | | |
2998 | 2889 | | |
| |||
0 commit comments