Skip to content

Conversation

@pgrayy
Copy link
Member

@pgrayy pgrayy commented Nov 7, 2025

Description

Set the handoff node to current only after the current node finishes. Currently, we make the switch in the middle of the current node execution. It is important to fix this for a few reasons:

  1. We emit the AfterNodeCallEvent with the current node id and state.current_node set to the handoff node. This is going to cause customer confusion.
  2. If the current node runs a tool that is interrupted in parallel (concurrently) to the hand off tool, the swarm state will be invalid. The swarm state needs a reference to the real current node so that users can properly respond to its interrupts and resume execution.

Related Issues

#204

Documentation PR

Implementation detail

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare: Relying on existing unit tests
  • I ran hatch test tests_integ/test_multiagent_swarm.py

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov
Copy link

codecov bot commented Nov 7, 2025

Codecov Report

❌ Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/strands/multiagent/swarm.py 90.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

logger.debug("reason=<%s> | stopping execution", reason)
break

# Get current node
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed a few inline comments because I felt the code was already self explanatory.

self.state.node_history.append(current_node)

# After self.state add current node, swarm state finish updating, we persist here
self.hooks.invoke_callbacks(AfterNodeCallEvent(self, current_node.node_id, invocation_state))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reiterate, setting self.state.current_node = handoff_node in the handoff tool means that AfterNodeCallEvent is emitted with a current node_id that does not match the self.state.current_node.node_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for supporting interrupts, we can't have self.state.current_node update to the handoff node if the current node is interrupted.


registry.add_callback(MultiAgentInitializedEvent, lambda event: self.initialize_multi_agent(event.source))
registry.add_callback(AfterNodeCallEvent, lambda event: self.sync_multi_agent(event.source))
registry.add_callback(BeforeNodeCallEvent, lambda event: self.sync_multi_agent(event.source))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have successfully executed one node and are now executing the handoff node. If we crash on the handoff node, we would be left in different states depending on which event we persist on:

  • AfterNodeCallEvent: Current node is not set to the handoff node in session because the handoff node hasn't yet emitted its AfterNodeCallEvent. This means if we resume after crashing on the handoff node, we will be starting again from the first node.
  • BeforeNodeCallEvent: Current node is set to the handoff node in session because the handoff node already emitted its BeforeNodeCallEvent. This means if we resume after crashing on the handoff node, we will be starting again from the handoff node.

In short, persisting on AfterNodeCallEvent only made sense when setting the current node to the handoff in the handoff tool.

@pgrayy pgrayy marked this pull request as ready for review November 7, 2025 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant