Skip to content

Commit edea1bf

Browse files
committed
Reformat ActorSupervisionEvent
Pull Request resolved: #1881 This changes the ActorSupervisionEvent structure so that we preserve enough information to give a good error message when an actor fails. The major changes are * removing jargon `processing error: superivision: `. * Adding user-understandable actor names. * identifying the actual actor that failed, and summarizing the default chain handling so that there are almost no wrappers around the error. Here are some examples of what it looks like now: When an actor directly errors: ``` I AM ABOUT TO ERROR!!!! Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Lambda actor> and all its descendants have failed. This occurred because the actor itself failed. The error was: Traceback (most recent call last): File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle response_port.exception(ActorError(e)) File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception raise obj from None File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle result = the_method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run return l() ^^^ File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error raise ValueError("Error.") ValueError: Error. ``` When a nested actor errors: ``` python/tests/test_supervision_hierarchy.py::test_nested_mesh_kills_actor_actor_error Monarch internal logs are being written to /tmp/zdevito/monarch_log.log ERRORED THE ACTOR I AM ABOUT TO ERROR!!!! Nest still alive 0 Nest still alive 1 Nest still alive 2 Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed. This occurred because the actor <root>.<tests.test_supervision_hierarchy.Nest actor>.<tests.test_supervision_hierarchy.Lambda nested> failed. The error was: Traceback (most recent call last): File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1068, in handle response_port.exception(ActorError(e)) File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 828, in exception raise obj from None File "/data/users/zdevito/fbsource/fbcode/monarch/python/monarch/_src/actor/actor_mesh.py", line 1062, in handle result = the_method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 19, in run return l() ^^^ File "/data/users/zdevito/fbsource/fbcode/monarch/python/tests/test_supervision_hierarchy.py", line 46, in error raise ValueError("Error.") ValueError: Error. ``` When a proc errors: ``` Unhandled monarch error on the top-level client: The actor <root>.<tests.test_supervision_hierarchy.Nest actor> and all its descendants have failed. This occurred because the actor unix:@eRu5gzLrP1kdciNpAErvY1Q9,anon_0_16inPfUmdpwZ,agent[0] failed. The error was: The process unix:@eRu5gzLrP1kdciNpAErvY1Q9 owned by this actor became unresponsive and is assumed dead, check the log on the host for details ``` The proc error includes the changes added in D86984496 to make agent failures more clean. We should eventually further improve this by making sure we generate a supervision event specific to process failure as noticed by the host agent. That should include a friendly name for the process (the processes name given during spawn, and its owning actor). . ghstack-source-id: 323410911 Differential Revision: [D86925582](https://our.internmc.facebook.com/intern/diff/D86925582/)
1 parent 0077e9c commit edea1bf

File tree

17 files changed

+302
-188
lines changed

17 files changed

+302
-188
lines changed

hyperactor/src/actor.rs

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,13 @@ pub trait Actor: Sized + Send + Debug + 'static {
150150
) -> Result<(), anyhow::Error> {
151151
handle_undeliverable_message(cx, envelope)
152152
}
153+
154+
/// If overridden, we will use this name in place of the
155+
/// ActorId for talking about this actor in supervision error
156+
/// messages.
157+
fn display_name(&self) -> Option<String> {
158+
None
159+
}
153160
}
154161

155162
/// Default implementation of [`Actor::handle_undeliverable_message`]. Defined
@@ -340,16 +347,24 @@ pub enum ActorErrorKind {
340347
#[error("{0}")]
341348
Generic(String),
342349

343-
/// An actor supervision event was not handled.
344-
#[error("supervision: {0}")]
350+
/// An error that occurred while trying to handle a supervision event.
351+
#[error("{0} while handling {1}")]
352+
ErrorDuringHandlingSupervision(String, Box<ActorSupervisionEvent>),
353+
/// The actor did not attempt to handle
354+
#[error("{0}")]
345355
UnhandledSupervisionEvent(Box<ActorSupervisionEvent>),
346356
}
347357

348358
impl ActorErrorKind {
349359
/// Error while processing actor, i.e., returned by the actor's
350360
/// processing method.
351361
pub fn processing(err: anyhow::Error) -> Self {
352-
Self::Generic(format!("processing error: {}", err))
362+
// Unbox err from the anyhow err. Check if it is an ActorErrorKind object.
363+
// If it is directly use it as the new ActorError's ActorErrorKind.
364+
// This lets us directly pass the ActorErrorKind::UnhandledSupervisionEvent
365+
// up the handling infrastructure.
366+
err.downcast::<ActorErrorKind>()
367+
.unwrap_or_else(|err| Self::Generic(err.to_string()))
353368
}
354369

355370
/// Unwound stracktrace of a panic.
@@ -434,15 +449,6 @@ impl From<MailboxSenderError> for ActorError {
434449
}
435450
}
436451

437-
impl From<ActorSupervisionEvent> for ActorError {
438-
fn from(inner: ActorSupervisionEvent) -> Self {
439-
Self {
440-
actor_id: Box::new(inner.actor_id.clone()),
441-
kind: Box::new(ActorErrorKind::UnhandledSupervisionEvent(Box::new(inner))),
442-
}
443-
}
444-
}
445-
446452
/// A collection of signals to control the behavior of the actor.
447453
/// Signals are internal runtime control plane messages and should not be
448454
/// sent outside of the runtime.

hyperactor/src/mailbox.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3143,7 +3143,7 @@ mod tests {
31433143
unreachable!()
31443144
};
31453145
assert!(msg.to_string().contains(
3146-
"processing error: a message from \
3146+
"a message from \
31473147
quux[0].foo[0] to corge[0].bar[0][9999] was undeliverable and returned"
31483148
));
31493149

hyperactor/src/mailbox/undeliverable.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,9 +165,9 @@ pub fn supervise_undeliverable_messages_with<R, F>(
165165

166166
if let Err(e) = sink.send(ActorSupervisionEvent::new(
167167
actor_id,
168+
None,
168169
ActorStatus::generic_failure(format!("message not delivered: {}", env)),
169170
Some(headers),
170-
None,
171171
)) {
172172
tracing::warn!(
173173
%e,

hyperactor/src/proc.rs

Lines changed: 12 additions & 121 deletions
Original file line numberDiff line numberDiff line change
@@ -1118,7 +1118,6 @@ impl<A: Actor> Instance<A> {
11181118
let result = self
11191119
.run_actor_tree(&mut actor, actor_loop_receivers, &mut work_rx)
11201120
.await;
1121-
11221121
let (actor_status, event) = match result {
11231122
Ok(_) => (ActorStatus::Stopped, None),
11241123
Err(ActorError {
@@ -1131,9 +1130,9 @@ impl<A: Actor> Instance<A> {
11311130
ActorStatus::Failed(error_kind.clone()),
11321131
Some(ActorSupervisionEvent::new(
11331132
self.cell.actor_id().clone(),
1133+
actor.display_name(),
11341134
ActorStatus::Failed(error_kind),
11351135
None,
1136-
None,
11371136
)),
11381137
)
11391138
}
@@ -1321,7 +1320,11 @@ impl<A: Actor> Instance<A> {
13211320
for supervision_event in supervision_event_receiver.drain() {
13221321
self.handle_supervision_event(actor, supervision_event).await?;
13231322
}
1324-
return Err(ActorError::new(self.self_id(), ActorErrorKind::processing(err)));
1323+
let kind = ActorErrorKind::processing(err);
1324+
return Err(ActorError {
1325+
actor_id: Box::new(self.self_id().clone()),
1326+
kind: Box::new(kind),
1327+
});
13251328
}
13261329
}
13271330
signal = signal_receiver.recv() => {
@@ -1381,28 +1384,17 @@ impl<A: Actor> Instance<A> {
13811384
Ok(())
13821385
}
13831386
Ok(false) => {
1384-
// The supervision event wasn't handled by this actor, chain it and bubble it up.
1385-
let supervision_event = ActorSupervisionEvent::new(
1386-
self.self_id().clone(),
1387-
ActorStatus::generic_failure("did not handle supervision event"),
1388-
None,
1389-
Some(Box::new(supervision_event)),
1390-
);
1391-
Err(supervision_event.into())
1387+
let kind = ActorErrorKind::UnhandledSupervisionEvent(Box::new(supervision_event));
1388+
Err(ActorError::new(self.self_id(), kind))
13921389
}
13931390
Err(err) => {
13941391
// The actor failed to handle the supervision event, it should die.
13951392
// Create a new supervision event for this failure and propagate it.
1396-
let supervision_event = ActorSupervisionEvent::new(
1397-
self.self_id().clone(),
1398-
ActorStatus::generic_failure(format!(
1399-
"failed to handle supervision event: {}",
1400-
err
1401-
)),
1402-
None,
1403-
Some(Box::new(supervision_event)),
1393+
let kind = ActorErrorKind::ErrorDuringHandlingSupervision(
1394+
err.to_string(),
1395+
Box::new(supervision_event),
14041396
);
1405-
Err(supervision_event.into())
1397+
Err(ActorError::new(self.self_id(), kind))
14061398
}
14071399
}
14081400
}
@@ -2892,107 +2884,6 @@ mod tests {
28922884
);
28932885
}
28942886

2895-
#[tokio::test]
2896-
async fn test_supervision_event_handler_propagates() {
2897-
#[derive(Debug)]
2898-
struct FailingSupervisionActor;
2899-
2900-
#[async_trait]
2901-
impl Actor for FailingSupervisionActor {
2902-
type Params = ();
2903-
2904-
async fn new(_: ()) -> Result<Self, anyhow::Error> {
2905-
Ok(Self)
2906-
}
2907-
2908-
async fn handle_supervision_event(
2909-
&mut self,
2910-
_this: &Instance<Self>,
2911-
_event: &ActorSupervisionEvent,
2912-
) -> Result<bool, anyhow::Error> {
2913-
anyhow::bail!("failed to handle supervision event!")
2914-
}
2915-
}
2916-
2917-
#[async_trait]
2918-
impl Handler<String> for FailingSupervisionActor {
2919-
async fn handle(
2920-
&mut self,
2921-
_cx: &crate::Context<Self>,
2922-
message: String,
2923-
) -> anyhow::Result<()> {
2924-
Err(anyhow::anyhow!(message))
2925-
}
2926-
}
2927-
2928-
#[derive(Debug)]
2929-
struct ParentActor(tokio::sync::mpsc::UnboundedSender<ActorSupervisionEvent>);
2930-
2931-
#[async_trait]
2932-
impl Actor for ParentActor {
2933-
type Params = tokio::sync::mpsc::UnboundedSender<ActorSupervisionEvent>;
2934-
2935-
async fn new(
2936-
supervision_events: tokio::sync::mpsc::UnboundedSender<ActorSupervisionEvent>,
2937-
) -> Result<Self, anyhow::Error> {
2938-
Ok(Self(supervision_events))
2939-
}
2940-
2941-
async fn handle_supervision_event(
2942-
&mut self,
2943-
_this: &Instance<Self>,
2944-
event: &ActorSupervisionEvent,
2945-
) -> Result<bool, anyhow::Error> {
2946-
self.0.send(event.clone()).unwrap();
2947-
Ok(true)
2948-
}
2949-
}
2950-
2951-
let proc = Proc::local();
2952-
2953-
let (event_tx, mut event_rx) = tokio::sync::mpsc::unbounded_channel();
2954-
2955-
let parent = proc.spawn::<ParentActor>("parent", event_tx).await.unwrap();
2956-
let child = proc
2957-
.spawn_child::<FailingSupervisionActor>(parent.cell().clone(), ())
2958-
.await
2959-
.unwrap();
2960-
let grandchild = proc
2961-
.spawn_child::<FailingSupervisionActor>(child.cell().clone(), ())
2962-
.await
2963-
.unwrap();
2964-
2965-
let child_actor_id = child.actor_id().clone();
2966-
let grandchild_actor_id = grandchild.actor_id().clone();
2967-
2968-
// Grandchild fails, triggering failure up the tree, finally receiving
2969-
// the event at the root.
2970-
grandchild.send("trigger failure".to_string()).unwrap();
2971-
2972-
assert!(grandchild.await.is_failed());
2973-
assert!(child.await.is_failed());
2974-
2975-
assert_eq!(
2976-
event_rx.recv().await.unwrap(),
2977-
// The time field is ignored for Eq and PartialEq.
2978-
ActorSupervisionEvent::new(
2979-
child_actor_id,
2980-
ActorStatus::generic_failure(
2981-
"failed to handle supervision event: failed to handle supervision event!"
2982-
),
2983-
None,
2984-
Some(Box::new(ActorSupervisionEvent::new(
2985-
grandchild_actor_id,
2986-
ActorStatus::generic_failure("processing error: trigger failure"),
2987-
None,
2988-
None,
2989-
))),
2990-
)
2991-
);
2992-
2993-
assert!(event_rx.try_recv().is_err());
2994-
}
2995-
29962887
#[tokio::test]
29972888
async fn test_instance() {
29982889
#[derive(Debug, Default, Actor)]

0 commit comments

Comments
 (0)