TQ: Support proxying Nexus-related API requests

andrewjstone · andrewjstone · commit d563f8d32143 · 2025-11-17T00:25:52.000Z
When committing configurations, the trust quorum protocol relies on
Nexus to be up and for Nexus to interact with each trust quorum node.
This allows Nexus to continuously try to commit nodes in an RPW and
record in the database which nodes have acked. This mirrors many of our
existing designs where Nexus observes and records information about
successful operations and retries via RPW. Specifically it matches how
we do things in the `Reconfigurator` and the `TUF Repo Depot`.

Unfortunately, Nexus cannot communicate with new sleds that are not
yet running a sled-agent, but still stuck in the bootstrap-agent.
This is because the bootstrap agents (and trust quorum protocol) only
communicate over the bootstrap network, which Nexus does not have access
to. Nodes must already be part of an existing configuration, running
sled-agent, and on the underlay network to talk to Nexus. In this common
case, Nexus sends trust quorum related messages to the sled-agent which
then calls the api of its local trust quorum `NodeTask`. This is not
possible for newly added sleds. While the trust quroum coordinator node
will tell new nodes to `Prepare` a configuration over the bootstrap
networtk, these new nodes do not have any mechanism to receive commits
from Nexus. Therefore we must proxy these commit related operations to
an existing member of the trust quorum when adding a new node. We also
added the ability to proxy `NodeStatus` requests to aid in debugging.

This PR therefore adds the ability to proxy certain requests from one
node to another so that we can commit nodes to the latest trust quorum
configuration, setup their encrypted storage, and boot their sled-agent.

It's worth noting that this is not the only way we could have solved
this problem. There are a few possibilities in the design space.

1. We could have had the coordinator always send commit operations and
collect acknowledgements as during the `Prepare` phase. Unfortunately,
if the coordinator dies before all nodes ack then Nexus would not be
able to ensure commit at all nodes. To make this reliable, Nexus would
still need to be able to reach out to uncommitted nodes and tell them to
commit. Since we already have to do the latter there is no reason to do
the former.

2. We could commit at the coordinator (or a few nodes), and then
have them gossip around information about commit. This is actually a
promising design, and is essentially what we do for the early network
config. Nexus could then wait for the sled-agent to start for those
nodes and ask them directly if they committed. This would still require
talking to all nodes and it adds some extra complexity, but it still
seems somewhat reasonable. The rationale for our current choice of
proxying was largely one of fitting our existing patterns. It's also
very useful for Nexus to be able directly ask a trust quorum node on
another sled about its status to diagnose issues.

So we went with the proxy mechanism as implemented here. Well, why did
we introduce another level of messages at the `Task` layer instead of
re-using the `CommitAdvance` functionality or adding new variants to
the `PeerMsg` in the `trust_quorum_protocol` crate? The rationale here
is largely that the trust quorum protocol as written in RFD 238 and
specified in TLA+ doesn't include this behavior. It expects commits from
the `Node` "API", meaning from `Nexus`. I didn't want to change that
behavior unnecessarily due to urgency, and an existing solid design.

It was also easier to build proxy operations this way since tracking
operations in async code with oneshot channels is easier than trying to
insert similar tracking into the `sans-io` code. In short, we left the
`trust-quorum-protocol` crate alone, and added some async helpers to the
`trust_quorum` crate.

One additional change was made in this PR. While adding the `tq_proxy`
test I noticed that we were unnecessarily using `wait_for_condition` on
initial commits, after we knew about succesful prepares. These commits
should always complete immediately and so I simplified this code in a
few existing tests.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/trust-quorum/Cargo.toml b/trust-quorum/Cargo.toml
@@ -16,6 +16,7 @@ camino.workspace = true
 chacha20poly1305.workspace = true
 ciborium.workspace = true
 daft.workspace = true
+debug-ignore.workspace = true
 derive_more.workspace = true
 futures.workspace = true
 gfss.workspace = true
diff --git a/trust-quorum/protocol/src/node.rs b/trust-quorum/protocol/src/node.rs
@@ -32,7 +32,9 @@ use crate::{
 use daft::{Diffable, Leaf};
 use gfss::shamir::Share;
 use omicron_uuid_kinds::RackUuid;
+use serde::{Deserialize, Serialize};
 use slog::{Logger, error, info, o, warn};
+use slog_error_chain::SlogInlineError;
 
 /// An entity capable of participating in trust quorum
 ///
@@ -1063,7 +1065,16 @@ impl Node {
     }
 }
 
-#[derive(Debug, Clone, thiserror::Error, PartialEq, Eq)]
+#[derive(
+    Debug,
+    Clone,
+    thiserror::Error,
+    PartialEq,
+    Eq,
+    SlogInlineError,
+    Serialize,
+    Deserialize,
+)]
 pub enum CommitError {
     #[error("invalid rack id")]
     InvalidRackId(
@@ -1077,7 +1088,16 @@ pub enum CommitError {
     Expunged { epoch: Epoch, from: BaseboardId },
 }
 
-#[derive(Debug, Clone, thiserror::Error, PartialEq, Eq)]
+#[derive(
+    Debug,
+    Clone,
+    thiserror::Error,
+    PartialEq,
+    Eq,
+    SlogInlineError,
+    Serialize,
+    Deserialize,
+)]
 pub enum PrepareAndCommitError {
     #[error("invalid rack id")]
     InvalidRackId(
diff --git a/trust-quorum/protocol/src/validators.rs b/trust-quorum/protocol/src/validators.rs
@@ -12,6 +12,7 @@ use crate::{
 };
 use daft::{BTreeSetDiff, Diffable, Leaf};
 use omicron_uuid_kinds::RackUuid;
+use serde::{Deserialize, Serialize};
 use slog::{Logger, error, info, warn};
 use std::collections::BTreeSet;
 
@@ -57,7 +58,9 @@ pub struct SledExpungedError {
     last_prepared_epoch: Option<Epoch>,
 }
 
-#[derive(Debug, Clone, thiserror::Error, PartialEq, Eq)]
+#[derive(
+    Debug, Clone, thiserror::Error, PartialEq, Eq, Serialize, Deserialize,
+)]
 #[error("mismatched rack id: expected {expected:?}, got {got:?}")]
 pub struct MismatchedRackIdError {
     pub expected: RackUuid,
diff --git a/trust-quorum/src/connection_manager.rs b/trust-quorum/src/connection_manager.rs
@@ -5,13 +5,15 @@
 //! A mechanism for maintaining a full mesh of trust quorum node connections
 
 use crate::established_conn::EstablishedConn;
+use crate::proxy;
 use trust_quorum_protocol::{BaseboardId, Envelope, PeerMsg};
 
 // TODO: Move to this crate
 // https://github.com/oxidecomputer/omicron/issues/9311
 use bootstore::schemes::v0::NetworkConfig;
 
 use camino::Utf8PathBuf;
+use derive_more::From;
 use iddqd::{
     BiHashItem, BiHashMap, TriHashItem, TriHashMap, bi_upcast, tri_upcast,
 };
@@ -60,7 +62,7 @@ pub enum MainToConnMsg {
 ///
 /// All `WireMsg`s sent between nodes is prefixed with a 4 byte size header used
 /// for framing.
-#[derive(Debug, Serialize, Deserialize)]
+#[derive(Debug, Serialize, Deserialize, From)]
 pub enum WireMsg {
     /// Used for connection keep alive
     Ping,
@@ -79,6 +81,12 @@ pub enum WireMsg {
     /// of tiny information layered on top of trust quorum. You can still think
     /// of it as a bootstore, although, we no longer use that name.
     NetworkConfig(NetworkConfig),
+
+    /// Requests proxied to other nodes
+    ProxyRequest(proxy::WireRequest),
+
+    /// Responses to proxy requests
+    ProxyResponse(proxy::WireResponse),
 }
 
 /// Messages sent from connection managing tasks to the main peer task
@@ -99,6 +107,8 @@ pub enum ConnToMainMsgInner {
     Received { from: BaseboardId, msg: PeerMsg },
     ReceivedNetworkConfig { from: BaseboardId, config: NetworkConfig },
     Disconnected { peer_id: BaseboardId },
+    ProxyRequestReceived { from: BaseboardId, req: proxy::WireRequest },
+    ProxyResponseReceived { from: BaseboardId, rsp: proxy::WireResponse },
 }
 
 pub struct TaskHandle {
@@ -120,15 +130,11 @@ impl TaskHandle {
         self.abort_handle.abort()
     }
 
-    pub async fn send(&self, msg: PeerMsg) {
-        let _ = self.tx.send(MainToConnMsg::Msg(WireMsg::Tq(msg))).await;
-    }
-
-    pub async fn send_network_config(&self, config: NetworkConfig) {
-        let _ = self
-            .tx
-            .send(MainToConnMsg::Msg(WireMsg::NetworkConfig(config)))
-            .await;
+    pub async fn send<T>(&self, msg: T)
+    where
+        T: Into<WireMsg>,
+    {
+        let _ = self.tx.send(MainToConnMsg::Msg(msg.into())).await;
     }
 }
 
@@ -172,7 +178,10 @@ impl EstablishedTaskHandle {
         self.task_handle.abort();
     }
 
-    pub async fn send(&self, msg: PeerMsg) {
+    pub async fn send<T>(&self, msg: T)
+    where
+        T: Into<WireMsg>,
+    {
         let _ = self.task_handle.send(msg).await;
     }
 }
@@ -235,6 +244,12 @@ pub struct ConnMgrStatus {
     pub total_tasks_spawned: u64,
 }
 
+/// The state of a proxy connection
+pub enum ProxyConnState {
+    Connected,
+    Disconnected,
+}
+
 /// A structure to manage all sprockets connections to peer nodes
 ///
 /// Each sprockets connection runs in its own task which communicates with the
@@ -399,7 +414,7 @@ impl ConnMgr {
                 "peer_id" => %h.baseboard_id,
                 "generation" => network_config.generation
             );
-            h.task_handle.send_network_config(network_config.clone()).await;
+            h.send(network_config.clone()).await;
         }
     }
 
@@ -415,7 +430,42 @@ impl ConnMgr {
                 "peer_id" => %h.baseboard_id,
                 "generation" => network_config.generation
             );
-            h.task_handle.send_network_config(network_config.clone()).await;
+            h.send(network_config.clone()).await;
+        }
+    }
+
+    /// Forward an API request to another node
+    ///
+    /// Return the state of the connection at this point in time so that the
+    /// [`proxy::Tracker`] can manage the outstanding request on behalf of the
+    /// user.
+    pub async fn proxy_request(
+        &mut self,
+        destination: &BaseboardId,
+        req: proxy::WireRequest,
+    ) -> ProxyConnState {
+        if let Some(h) = self.established.get1(destination) {
+            info!(self.log, "Sending {req:?}"; "peer_id" => %destination);
+            h.send(req).await;
+            ProxyConnState::Connected
+        } else {
+            ProxyConnState::Disconnected
+        }
+    }
+
+    /// Return a response to a proxied request to another node
+    ///
+    /// There is no need to track whether this succeeds or fails. If the
+    /// connection goes away the client on the other side will notice it and
+    /// retry if needed.
+    pub async fn proxy_response(
+        &mut self,
+        destination: &BaseboardId,
+        rsp: proxy::WireResponse,
+    ) {
+        if let Some(h) = self.established.get1(destination) {
+            info!(self.log, "Sending {rsp:?}"; "peer_id" => %destination);
+            h.send(rsp).await;
         }
     }
 
diff --git a/trust-quorum/src/established_conn.rs b/trust-quorum/src/established_conn.rs
@@ -233,6 +233,36 @@ impl EstablishedConn {
                         panic!("Connection to main task channnel full");
                     }
                 }
+                WireMsg::ProxyRequest(req) => {
+                    if let Err(_) = self.main_tx.try_send(ConnToMainMsg {
+                        task_id: self.task_id,
+                        msg: ConnToMainMsgInner::ProxyRequestReceived {
+                            from: self.peer_id.clone(),
+                            req,
+                        },
+                    }) {
+                        error!(
+                            self.log,
+                            "Failed to send received proxy msg to the main task"
+                        );
+                        panic!("Connection to main task channel full");
+                    }
+                }
+                WireMsg::ProxyResponse(rsp) => {
+                    if let Err(_) = self.main_tx.try_send(ConnToMainMsg {
+                        task_id: self.task_id,
+                        msg: ConnToMainMsgInner::ProxyResponseReceived {
+                            from: self.peer_id.clone(),
+                            rsp,
+                        },
+                    }) {
+                        error!(
+                            self.log,
+                            "Failed to send received proxy msg to the main task"
+                        );
+                        panic!("Connection to main task channel full");
+                    }
+                }
             }
         }
     }
diff --git a/trust-quorum/src/ledgers.rs b/trust-quorum/src/ledgers.rs
@@ -17,7 +17,7 @@ use slog::{Logger, info};
 use trust_quorum_protocol::PersistentState;
 
 /// A wrapper type around [`PersistentState`] for use as a [`Ledger`]
-#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct PersistentStateLedger {
     pub generation: u64,
     pub state: PersistentState,
diff --git a/trust-quorum/src/lib.rs b/trust-quorum/src/lib.rs
@@ -7,9 +7,10 @@
 mod connection_manager;
 pub(crate) mod established_conn;
 mod ledgers;
+mod proxy;
 mod task;
 
 pub(crate) use connection_manager::{
     ConnToMainMsg, ConnToMainMsgInner, MainToConnMsg, WireMsg,
 };
-pub use task::NodeTask;
+pub use task::{CommitStatus, Config, NodeApiError, NodeTask, NodeTaskHandle};
diff --git a/trust-quorum/src/proxy.rs b/trust-quorum/src/proxy.rs
diff --git a/trust-quorum/src/task.rs b/trust-quorum/src/task.rs