TQ: Support proxying Nexus-related API requests #9403

andrewjstone · 2025-11-13T23:43:46Z

When committing configurations, the trust quorum protocol relies on Nexus to be up and for Nexus to interact with each trust quorum node. This allows Nexus to continuously try to commit nodes in an RPW and record in the database which nodes have acked. This mirrors many of our existing designs where Nexus observes and records information about successful operations and retries via RPW. Specifically it matches how we do things in the Reconfigurator and the TUF Repo Depot.

Unfortunately, Nexus cannot communicate with new sleds that are not yet running a sled-agent, but still stuck in the bootstrap-agent. This is because the bootstrap agents (and trust quorum protocol) only communicate over the bootstrap network, which Nexus does not have access to. Nodes must already be part of an existing configuration, running sled-agent, and on the underlay network to talk to Nexus. In this common case, Nexus sends trust quorum related messages to the sled-agent which then calls the api of its local trust quorum NodeTask. This is not possible for newly added sleds. While the trust quroum coordinator node will tell new nodes to Prepare a configuration over the bootstrap networtk, these new nodes do not have any mechanism to receive commits from Nexus. Therefore we must proxy these commit related operations to an existing member of the trust quorum when adding a new node. We also added the ability to proxy NodeStatus requests to aid in debugging.

This PR therefore adds the ability to proxy certain requests from one node to another so that we can commit nodes to the latest trust quorum configuration, setup their encrypted storage, and boot their sled-agent.

It's worth noting that this is not the only way we could have solved this problem. There are a few possibilities in the design space.

We could have had the coordinator always send commit operations and collect acknowledgements as during the Prepare phase. Unfortunately, if the coordinator dies before all nodes ack then Nexus would not be able to ensure commit at all nodes. To make this reliable, Nexus would still need to be able to reach out to uncommitted nodes and tell them to commit. Since we already have to do the latter there is no reason to do the former.
We could commit at the coordinator (or a few nodes), and then have them gossip around information about commit. This is actually a promising design, and is essentially what we do for the early network config. Nexus could then wait for the sled-agent to start for those nodes and ask them directly if they committed. This would still require talking to all nodes and it adds some extra complexity, but it still seems somewhat reasonable. The rationale for our current choice of proxying was largely one of fitting our existing patterns. It's also very useful for Nexus to be able directly ask a trust quorum node on another sled about its status to diagnose issues.

So we went with the proxy mechanism as implemented here. Well, why did we introduce another level of messages at the Task layer instead of re-using the CommitAdvance functionality or adding new variants to the PeerMsg in the trust_quorum_protocol crate? The rationale here is largely that the trust quorum protocol as written in RFD 238 and specified in TLA+ doesn't include this behavior. It expects commits from the Node "API", meaning from Nexus. I didn't want to change that behavior unnecessarily due to urgency, and an existing solid design.

It was also easier to build proxy operations this way since tracking operations in async code with oneshot channels is easier than trying to insert similar tracking into the sans-io code. In short, we left the trust-quorum-protocol crate alone, and added some async helpers to the trust_quorum crate.

One additional change was made in this PR. While adding the tq_proxy test I noticed that we were unnecessarily using wait_for_condition on initial commits, after we knew about succesful prepares. These commits should always complete immediately and so I simplified this code in a few existing tests.

pietroalbini · 2025-11-14T09:42:04Z

trust-quorum/src/connection_manager.rs

+            h.send(req).await;
+            ProxyConnState::Connected


In a previous PR I noted that the send() method ignores all failures and returns a success (well, it doesn't return a Result) even if the connection broke. For the trust quorum protocol that seemed fine, as the protocol is resilient to messages not being delivered.

The proxy APIs defined in another file expect either a response from the server or a Disconnected error before returning: if it receives nothing it will block forever. After spending probably way too much time thinking about failure cases related to this¹, if the message is sent to the established connection actor things will eventually be fine: when we returns, the caller will add the request to the tracker, and a disconnection detected by the established connection actor will be eventually relayed to the tracker.

The failure case I still see is when the channel is busy (with 10 pending requests) and the send method silently discards the message. In that case, we return a ProxyConnState::Connected and the request will be added to the tracker, but we will never get a response unless the connection breaks due to another unrelated request failing (since this request never got to the actor).

The code as is could be fine if the caller to any proxy method takes care of adding timeouts everywhere, but this feels like a problem waiting to happen. I'd feel way more comfortable if this returned a ProxyConnState::Busy if sending a message to the channel failed.

Footnotes

I lost count of how many times I rewrote this comment with other failure cases that turns out couldn't happen. ↩

This is a great analysis @pietroalbini. Thank you so much for taking the time to go through this process and understand the code and design. I actually think if this was a problem, that it would also be somewhat of a problem for the normal trust quorum protocol, as some messages from peers are never resent on a timer.

For example: If a coordinator is sending a Prepare msg to a peer and the and the message got dropped before making it to the connection task, it would not be resent unless the channel got disconnected and reconnected for some other reason. Now, that being said the volume of messages is small and this should not happen. And as you point out, there is some built in resilience. If the commit occurred nexus would prepareAndCommit this node or it would get a CommitAdvance message on the next restart (perhaps after an update). But it still could end up as a problem if too many nodes did this and the coordinator couldn't complete the prepare phase. Nexus would try a new configuration at perhaps a different coordinator after some time without prepare completing, but the system may still be overloaded.

With all that being said, I don't actually think what you pointed out is entirely true, and therefore this isn't actually a problem here. However, this analysis is also non-trivial. It almost makes me question whether using connection state for retries instead of timers is the right move. So far, I think it continues to work and has the benefit of not re-sending messages already sent over a reliable stream. Ok, so back to the problem.

The failure case I still see is when the channel is busy (with 10 pending requests) and the send method silently discards the message.

The channel being used in send is an mpsc::bounded channel and blocks on send. The error that is discarded is when the channel itself gets disconnected, presumably because the task exited. In that case the Disconnect callback will eventually fire and all is well.

To help ensure that the disconnect callback occurs when buffers start filling up, there is also a MSG_WRITE_QUEUE_CAPACITY for each established connection that will disconnect if too many `` messages are pulled off the channel and serialized before they can be sent. Somewhat importantly, this channel is sized smaller than the queue, so if the queue is full it means that the TCP connection (or serialization) is too slow to move things along. We get backpressure, and eventually a disconnection that should allow things to clear up on a reconnect. I should cleanup the TODO suggestion there as we actually can't drop messages or we will break things as you point out.

It appears there is actually one place where the ConnToMainMsgInner::Disconnected message doesn't get sent to the NodeTask task when an EstablishedConnection closes. That is when the EstablishedConnection task itself panics. I believe that we use panic = abort in production, and so this isn't actually a problem in practice.

However, this has me wondering if instead I should signal to NodeTask from the ConnectionManager::step method when an EstablishedConnection exits rather than sending a Disconnected message from the task itself. That would cover both the successful exit and panic cases for the EstablishedConnection task.

Unfortunately, I also realized that there is another race condition that may get worse if I do this. If a new connection is accepted for the same peer it will trigger an abort of the old connection. In this case the old disconnect will occur after the new connection is established. That could cause problems for the protocol, and I should gate it the Node::on_disconnect task by looking to see if the task_id matches the current established task ID, as is done for the connection manager on_disconnected callback.

Another option to solve the latter problem is to always reject a new accepted connection for the same peer if one is already established. Eventually the old one will go away, and the remote peer will retry.

I need to think about this a bit more, but will likely add a few small cleanup patches.

IIUC, part of Emily's concern is that the request tracker inserted into proxy_tracker here https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returns ProxyConnState::Connected because the disconnect callback has not fired yet. I think we need to either remove the entry from proxy_tracker in that case or have send indicate whether the task is still there so that we could return ProxyConnState::Disconnected and not insert into proxy_tracker?

IIUC, part of Emily's concern is that the request tracker inserted into proxy_tracker here https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returns ProxyConnState::Connected because the disconnect callback has not fired yet. I think we need to either remove the entry from proxy_tracker in that case or have send indicate whether the task is still there so that we could return ProxyConnState::Disconnected and not insert into proxy_tracker?

That behavior is intentional. There is an inherent TOCTTOU where the message can be put on the channel and then the socket can disconnect. In this case we return the Connected and then get a Disconnected callback sometime later to clear the state. This is also exactly what would happen if the message pulled off the channel, serialized, was sent over the socket, and then the channel disconnected. The key invariant to uphold is: if at any time a message is lost the disconnect callback must fire a short time after. No further messages should be able to be sent over the socket.

What makes this work is that the disconnect callback always fires after the tracked socket is recorded. We know it hasn't fired yet because the EstablishedTaskHandle is still in the main map which is in the same task that is doing the send. Therefore any disconnect will come immediately after the send if the task is gone. If there is no handle in the map then we return disconnected. Note that it's also possible that the connection had happened already but the main task hasn't learned yet. So we can discard even if connected already. TOCTTUOs all around.

Thanks for the discussion @pietroalbini and @hawkw. I changed up how disconnections are managed in e8dedbd and I think it makes things much clearer and more reliable now.

I really appreciate this type of review, as it leads to clarity and robustness. Please let me know if I missed anything.

The channel being used in send is an mpsc::bounded channel and blocks on send. The error that is discarded is when the channel itself gets disconnected, presumably because the task exited. In that case the Disconnect callback will eventually fire and all is well.

I think the big misunderstanding on my end is that I somehow convinced myself that the send method returned an error when the queue was full instead of blocking. Now things make way more sense.

I really appreciate the change though: it was fairly hard to reason about failure cases with the way the code was written earlier, and any change to make the code easier to reason about is a win in my book :D

trust-quorum/src/established_conn.rs

trust-quorum/src/proxy.rs

trust-quorum/src/task.rs

pietroalbini · 2025-11-14T12:06:48Z

trust-quorum/src/proxy.rs

+    ) -> Result<V, ProxyError<E>>
+    where
+        E: std::error::Error,
+        F: FnOnce(Result<WireValue, WireError>) -> Result<V, ProxyError<E>>,


The destructuring function adds a lot of duplication. The only way I can see to remove a lot of this duplication would be to keep WireValue and WireError in a dynamic representation similar to serde_json::Value until this function, and then require V: Deserialize, E: Deserialize, but that adds a lot of other problems :(

Well, another option might be to have a separate channel for each kind of request, and select over all of them; that may not work if we are relying on the channel for ordering messages. But, on the other hand, it does mean that we could use a biased select to prioritize which message types are handled first (which may not be desirable either), and we wouldn't have to do the destructuring thing.

TLDR: I spent a good chunk of time on this implementation and deliberately chose this structure to allow for an ergonomic, strongly typed client API, with generic proxied messages sent over the wire. I remain unconvinced there is a much better way to do that than what is here.

The destructuring function adds a lot of duplication.

I agree somewhat, but I don't really see a better way to do this. The closure is there precisely to avoid even more duplication by abstracting out all the interesting work to the send method. Each method with strongly typed replies has to have some way to take a more generic response and filter out what it is expecting, or punt this matching to the caller. This strategy allows the caller to work with concrete types similar to local calls while ignoring the larger sum types required for sending over the network.

Furthermore, only the exact types need to be looked at in the matches, and the rest of the enum variants ignored with an error. This is really the easiest way I can see to provide such an ergonomic API for the caller at the cost of some overhead when adding a new proxy operation, which I don't actually expect to do much, if at all. I think a macro could probably shorten this duplication, but I also don't think it's worth the effort right now.

Well, another option might be to have a separate channel for each kind of request, and select over all of them; that may not work if we are relying on the channel for ordering messages. But, on the other hand, it does mean that we could use a biased select to prioritize which message types are handled first (which may not be desirable either), and we wouldn't have to do the destructuring thing.

There aren't any ordering requirements, but I"m also not sure how well this would work. I think that there will always need to be some place where we take a more generic type and strip it down to the concrete type if we want to provide an ergonomic, strongly typed API. With your suggestion, I think that place would move into the NodeTask, which largely doesn't have to worry about the structure of proxied replies at all right now.

The Proxy is ultimately sending over a channel to the NodeTask which which is then sending the message out over the network. The NodeTask can indeed select over multiple channels, obviating the need for the destructuring to live here. However, the node task would then have to take the response off the wire containing enums and do the destructuring so it could send back concrete types over distinct response channels. It would also have to have enough knowledge to know that a specific response doesn't correspond to a specific request type. That knowledge is currently encapsulated inside each Proxy method making the reasoning fully local.

trust-quorum/src/connection_manager.rs

trust-quorum/src/established_conn.rs

hawkw · 2025-11-14T23:31:46Z

trust-quorum/src/proxy.rs

+    ) -> Result<V, ProxyError<E>>
+    where
+        E: std::error::Error,
+        F: FnOnce(Result<WireValue, WireError>) -> Result<V, ProxyError<E>>,


Well, another option might be to have a separate channel for each kind of request, and select over all of them; that may not work if we are relying on the channel for ordering messages. But, on the other hand, it does mean that we could use a biased select to prioritize which message types are handled first (which may not be desirable either), and we wouldn't have to do the destructuring thing.

hawkw · 2025-11-14T23:32:39Z

trust-quorum/src/proxy.rs

+    Inner(#[from] T),
+    #[error("disconnected")]
+    Disconnected,
+    #[error("response for different type received: {0}")]


should this perhaps indicate what the expected response type was, or do we only expect to log this in a place where it's obvious what we were expecting?

It's likely obvious because the caller is using strong types and is getting a typed response back. They know what T is. But we don't really have any good way to log that here AFAIK.

trust-quorum/src/proxy.rs

trust-quorum/src/task.rs

hawkw · 2025-11-14T23:45:22Z

trust-quorum/src/connection_manager.rs

+            h.send(req).await;
+            ProxyConnState::Connected


IIUC, part of Emily's concern is that the request tracker inserted into proxy_tracker here https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returns ProxyConnState::Connected because the disconnect callback has not fired yet. I think we need to either remove the entry from proxy_tracker in that case or have send indicate whether the task is still there so that we could return ProxyConnState::Disconnected and not insert into proxy_tracker?

When committing configurations, the trust quorum protocol relies on Nexus to be up and for Nexus to interact with each trust quorum node. This allows Nexus to continuously try to commit nodes in an RPW and record in the database which nodes have acked. This mirrors many of our existing designs where Nexus observes and records information about successful operations and retries via RPW. Specifically it matches how we do things in the `Reconfigurator` and the `TUF Repo Depot`. Unfortunately, Nexus cannot communicate with new sleds that are not yet running a sled-agent, but still stuck in the bootstrap-agent. This is because the bootstrap agents (and trust quorum protocol) only communicate over the bootstrap network, which Nexus does not have access to. Nodes must already be part of an existing configuration, running sled-agent, and on the underlay network to talk to Nexus. In this common case, Nexus sends trust quorum related messages to the sled-agent which then calls the api of its local trust quorum `NodeTask`. This is not possible for newly added sleds. While the trust quroum coordinator node will tell new nodes to `Prepare` a configuration over the bootstrap networtk, these new nodes do not have any mechanism to receive commits from Nexus. Therefore we must proxy these commit related operations to an existing member of the trust quorum when adding a new node. We also added the ability to proxy `NodeStatus` requests to aid in debugging. This PR therefore adds the ability to proxy certain requests from one node to another so that we can commit nodes to the latest trust quorum configuration, setup their encrypted storage, and boot their sled-agent. It's worth noting that this is not the only way we could have solved this problem. There are a few possibilities in the design space. 1. We could have had the coordinator always send commit operations and collect acknowledgements as during the `Prepare` phase. Unfortunately, if the coordinator dies before all nodes ack then Nexus would not be able to ensure commit at all nodes. To make this reliable, Nexus would still need to be able to reach out to uncommitted nodes and tell them to commit. Since we already have to do the latter there is no reason to do the former. 2. We could commit at the coordinator (or a few nodes), and then have them gossip around information about commit. This is actually a promising design, and is essentially what we do for the early network config. Nexus could then wait for the sled-agent to start for those nodes and ask them directly if they committed. This would still require talking to all nodes and it adds some extra complexity, but it still seems somewhat reasonable. The rationale for our current choice of proxying was largely one of fitting our existing patterns. It's also very useful for Nexus to be able directly ask a trust quorum node on another sled about its status to diagnose issues. So we went with the proxy mechanism as implemented here. Well, why did we introduce another level of messages at the `Task` layer instead of re-using the `CommitAdvance` functionality or adding new variants to the `PeerMsg` in the `trust_quorum_protocol` crate? The rationale here is largely that the trust quorum protocol as written in RFD 238 and specified in TLA+ doesn't include this behavior. It expects commits from the `Node` "API", meaning from `Nexus`. I didn't want to change that behavior unnecessarily due to urgency, and an existing solid design. It was also easier to build proxy operations this way since tracking operations in async code with oneshot channels is easier than trying to insert similar tracking into the `sans-io` code. In short, we left the `trust-quorum-protocol` crate alone, and added some async helpers to the `trust_quorum` crate. One additional change was made in this PR. While adding the `tq_proxy` test I noticed that we were unnecessarily using `wait_for_condition` on initial commits, after we knew about succesful prepares. These commits should always complete immediately and so I simplified this code in a few existing tests.

Rather than send a message when a peer disconnects and then handle that message asynchronously, we now wait for the task itself to exit and then return a `DisconnectedPeer` from `ConnectionManager::step`. We only return the `PeerId` for the `DisconnectedPeer` if it is still the existing `established connection` for the given peer and it hasn't been replaced by a newer connection. This prevents calling `Node::on_disconnect` for the stale connection when it might have already received an `on_connect` call for the new connection.

andrewjstone requested review from hawkw, jgallagher, pietroalbini and sunshowers November 13, 2025 23:43

pietroalbini reviewed Nov 14, 2025

View reviewed changes

hawkw reviewed Nov 14, 2025

View reviewed changes

andrewjstone added 6 commits November 17, 2025 00:25

comments

e74e09d

review fixes

4b6d310

clippy

c2f99e5

whoops, missed one

4a8e80b

andrewjstone force-pushed the tq-proxy branch from 3e4ea62 to 4a8e80b Compare November 17, 2025 00:28

pietroalbini approved these changes Nov 17, 2025

View reviewed changes

fix docs

e7def4d

andrewjstone enabled auto-merge (squash) November 17, 2025 17:06

andrewjstone merged commit 8a4860b into main Nov 17, 2025
17 checks passed

andrewjstone deleted the tq-proxy branch November 17, 2025 19:26

TQ: Support proxying Nexus-related API requests #9403

TQ: Support proxying Nexus-related API requests #9403

Uh oh!

Conversation

andrewjstone commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

andrewjstone Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrewjstone Nov 14, 2025 •

edited

Loading

andrewjstone Nov 17, 2025 •

edited

Loading