From 55841caff236f58d9680d9f9b6ada24256b34b85 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Sat, 11 Oct 2025 20:20:07 -0700 Subject: [PATCH 01/16] libibverbs: Document verbs semantic model A user of libibverbs must rely heavily on external documentation, specifically the IBTA vol. 1 specification, to understand how the API is used. However, the API itself has evolved beyond support for only Infiniband. This leaves both users and potential vendors trying to plug into the API struggling, as the names used by the library reflect Infiniband naming, but the concepts have broader use. To provide better guidance on what the current verbs semantic model describes, provide documentation on how major verbs constructs are used. This includes referencing the historical meaning of verbs objects, as well as their evolved use. The proposed descriptions are directly intended to help new transports, such as Ultra Ethernet, understand how to adopt verbs for best results and where potential changes may be needed. Signed-off-by: Sean Hefty --- Documentation/libibverbs.md | 285 +++++++++++++++++++++++++++++++++++- 1 file changed, 280 insertions(+), 5 deletions(-) diff --git a/Documentation/libibverbs.md b/Documentation/libibverbs.md index 980f354a3..0f7984382 100644 --- a/Documentation/libibverbs.md +++ b/Documentation/libibverbs.md @@ -1,10 +1,9 @@ # Introduction -libibverbs is a library that allows programs to use RDMA "verbs" for -direct access to RDMA (currently InfiniBand and iWARP) hardware from -userspace. For more information on RDMA verbs, see the InfiniBand -Architecture Specification vol. 1, especially chapter 11, and the RDMA -Consortium's RDMA Protocol Verbs Specification. +libibverbs is a library that allows userspace programs direct +access to high-performance network hardware. See the Verbs +Semantics section at the end of this document for details +on RDMA and verbs constructs. # Using libibverbs @@ -74,3 +73,279 @@ The following table describes the expected behavior when VERBS_LOG_LEVEL is set: |-----------------|---------------------------------|------------------------------------------------| | Regular prints | Output to VERBS_LOG_FILE if set | Output to VERBS_LOG_FILE, or stderr if not set | | Datapath prints | Compiled out, no output | Output to VERBS_LOG_FILE, or stderr if not set | + + +# Verbs Semantics + +Verbs is defined by the InfiniBand Architecture Specification +(vol. 1, chapter 11) as an abstract definition of the functionality +provided by an Infiniband NIC. libibverbs was designed as a formal +software API aligned with that abstraction. As a result, API names, +including the library name, are closely aligned with those defined +for Infiniband. + +However, the library and API have evolved to support additional +high-performance transports and NICs. libibverbs constructs have +expanded beyond their traditional roles and definitions, except that +the original Infiniband naming has been kept for backwards +compatibility purposes. + +Today, verbs can be viewed as defining software primitives for +network hardware supporting one or more of the following: + +- Network queues are directly accessible from user space. +- Network hardware can directly access application memory buffers. +- The transport supports RDMA operations. + +The following sections describe select libibverbs constructs in terms +of their current semantics and, where appropriate, historical context. +Items are ordered conceptually. + +*RDMA* +: RDMA takes on several different meanings based on context, + which are further described below. RDMA stands for remote direct memory + access. Historically, RDMA referred to network operations which could + directly read or write application data buffers at the target. + The use of the term RDMA has since evolved to encompass not just + network operations, but also the key features of such devices: + + - Zero-copy: no intermediate buffering + - Low CPU utilization: transport offload + - High bandwidth and low latency + +*RDMA Verbs* +: RDMA verbs is the more generic name given to the libibverbs API, + as it implies support for other transports beyond Infiniband. + A device which supports RDMA verbs is accessible through this library. + + A common, but restricted, industry use of the term RDMA verbs frequently + implies the subset of libibverbs APIs and semantics focused on reliable- + connected communication. This document will use the term RDMA verbs as + a synonym for the libibverbs API as a whole. + +*RDMA-Core* +: The rdma-core is a set of libraries for interfacing with the Linux + kernel RDMA subsystem. Two key rdma-core libraries are this one, + libibverbs, and the librdmacm, which is used to establish connections. + + The rdma-core is considered an essential component of Linux RDMA. + It is used to ensure that the kernel ABI is stable and implements the + user space portion of the kernel RDMA IOCTL API. + +*RDMA Device / Verbs Device / NIC* +: An RDMA or verbs device is one which is accessible through the Linux + RDMA subsystem, and as a result, plugs into the libibverbs and rdma-core + framework. NICs plug into the RDMA subsystem to expose hardware + primitives supported by verbs (described above) or RDMA-like features. + + NICs do not necessarily need to support RDMA operations or transports + in order to leverage the rdma-core infrastructure. It is sufficient for + a NIC to expose similar features found in RDMA devices. + +*RDMA Operation* +: RDMA operations refer to network transport functions that read or write + data buffers at the target without host CPU intervention. RDMA reads + copy data from a remote memory region to the network and return the data + to the initiator of the request. RDMA writes copy data from a local + memory region to the network and place it directly into a memory region + at the target. + +*RDMA Transport* +: An RDMA transport can be considered any transport that supports RDMA + operations. Common RDMA transports include Infiniband, + RoCE (RDMA over Converged Ethernet), RoCE version 2, and iWarp. RoCE + and RoCEv2 are Infiniband transports over the Ethernet link layer, with + differences only in their lower-level addressing. + However, the term Infiniband usually refers to the Infiniband transport + over the Infiniband link layer. RoCE is used when explicitly + referring to Ethernet based solutions. RoCE version 2 is often included + or implied by references to RoCE. + +*Device Node* +: The original intent of device node type was to identify if an Infiniband + device was a NIC, switch, or router. Infiniband NICs were labeled as + channel adapters (CA). Node type was extended to identify the transport + being manipulated by verb primitives. Devices which implemented other + transports were assigned new node types. As a result, applications which + targeted a specific transport, such as Infiniband or RoCE, relied on node + type to indirectly identify the transport. + +*Protection Domain (PD)* +: A protection domain provides process-level isolation of resources and is + considered a fundamental security construct for Linux RDMA devices. + A PD defines a boundary between memory regions and queue pairs. A + network data transfer is associated with a single queue pair. That queue + pair may only access a memory region that shares the same protection + domain as itself. This prevents a user space process from accessing + memory buffers outside of its address space. + + Protection domains provide security for regions accessed + by both local and remote operations. Local access includes work requests + posted to HW command queues which reference memory regions. Remote + access includes RDMA operations which read or write memory regions. + + A queue pair is associated with a single PD. The PD verifies that hardware + access to a given lkey or rkey is valid for the specified QP and the + initiating or targeted process has permission to the lkey or rkey. Vendors + may implement a PD using a variety of mechanisms, but are required to meet + the defined security isolation. + +*Memory Region (MR)* +: A memory region identifies a virtual address range known to the NIC. + MRs are registered address ranges accessible by the NIC for local and + remote operations. The process of creating a MR associates the given + virtual address range with a protection domain, in order to ensure + process-level isolation. + + Once allocated, data transfers reference the MR using a key value (lkey + and/or rkey). When accessing a MR as part of a data transfer, an offset + into the memory region is specified. The offset is relative to the start + of the region and may either be 0-based or based on the region’s starting + virtual address. + +*lkey* +: The lkey is designed as a hardware identifier for a locally accessed data + buffer. Because work requests are formatted by user space software and + may be written directly to hardware queues, hardware must validate + that the memory buffers being referenced are accessible to the application. + + NIC hardware may not have access to the operating system's + virtual address translation table. Instead, hardware can use the lkey to + identify the registered memory region, which in turn identifies a protection + domain, which finally identifies the calling process. The protection domain + the processing queue pair must match that of the accessed memory region. + This prevents an application from sending data from buffers outside of its + virtual address space. + +*rkey* +: The rkey is designed as a transport identifier for remotely accessed data + buffers. It's conceptually like an lkey, but the value is + shared across the network. An rkey is associated with transport + permissions. + +*Completion Queue (CQ)* +: A completion queue is designed to represent a hardware queue where the + status of asynchronous operations is reported. Each asynchronous + operation (i.e. data transfer) is expected to write a single entry + into the completion queue. + +*Queue Pair (QP)* +: A queue pair was originally defined as a transport addressable set of + hardware queues, with a QP consisting of send and receive queues (defined + below). The evolved definition of a QP refers only to the transport + addressability of an endpoint. A QP's address is identified as a + queue pair number (QPN), which is conceptually like a transport + port number. In networking stack models, a QP is considered a transport + layer object. + + The internal structure of the QP is not constrained to a pair of queues. + The number of hardware queues and their purpose may vary based on how + the QP is configured. A QP may have 0 or more command queues used for + posting data transfer requests (send queues) and 0 or more command queues + for posting data buffers used to receive incoming messages (receive queues). + +*Receive Queue (RQ)* +: Receive queues are command queues belonging to queue pairs. Receive + commands post application buffers to receive incoming data. + + Receive queues are configured as part of queue pair setup. A RQ is + accessed indirectly through the QP when submitting receive work requests. + +*Shared Receive Queue (SRQ)* +: A shared receive queue is a single hardware command queue for posting + buffers to receive incoming data. This command queue may be shared + among multiple QPs, such that data that arrives on any associated QP + may retrieve a previously posted buffer from the SRQ. QPs that share + the same SRQ coordinate their access to posted buffers such that a + single posted operation is matched with a single incoming message. + + Unlike receive queues, SRQs are accessed directly by applications to + submit receive work requests. + +*Send Queue (SQ)* +: More generically, a send queue is a transmit queue. It + represents a command queue for operations that initiate a network operation. + A send queue may also be used to submit commands that update hardware + resources, such as updating memory regions. Network operations submitted + through the send queue include message sends, RDMA reads, RDMA writes, and + atomic operations, among others. + + Send queues are configured as part of queue pair setup. A SQ is + accessed indirectly through the QP when submitting send work requests. + +*Send Message* +: A send message refers to a specific type of transport data transfer. + A send message operation copies data from a local buffer to the network + and transfers the data as a single transport unit. The receiving NIC + copies the data from the network into a user posted receive message + buffer(s). + + Like the term RDMA, the meaning of send is context dependent. Send could + refer to the transmit command queue, any operation posted to the transmit + (send) queue, or a send message operation. + +*Work Request (WR)* +: A work request is a command submitted to a queue pair, work queue, or + shared receive queue. Work requests define the type of network operation + to perform, including references to any memory regions the operation will + access. + + A send work request is a transmit operation that is directed to the send + queue of a queue pair. A receive work request is an operation posted + to either a shared receive queue or a QP's receive queue. + +*Address Handle (AH)* +: An address handle identifies the link and/or network layer addressing to + a network port or multicast group. + + With legacy Infiniband, an address handle is a link layer object. For other + transports, including RoCE, the address handle is a network layer object. + +*Global Identifier (GID)* +: Infiniband defines a GID as an optional network-layer or multicast address. + Because GIDs are large enough to store an IPv6 address, their use has evolved + to support other transports. A GID identifies a network port, with the most + well-known GIDs being IPv4 and IPv6 addresses. + +*GID Type* +: The GID type determines the specific type of GID address being referenced. + Additionally, it identifies the set of addressing headers underneath the + transport header. + + An RDMA transport protocol may be layered over different networking stacks. + An RDMA transport may layer directly over a link layer (like Infiniband or + Ethernet), over the network layer (such as IP), or another transport + layer (such as TCP or UDP). The GID type conveys how the RDMA transport + stack is constructed, as well as how the GID address is interpreted. + +*GID Index* +: RDMA addresses are securely managed to ensure that unprivileged + applications do not inject arbitrary source addresses into the network. + Transport addresses are injected by the queue pair. Network addresses + are selected from a set of addresses stored in a source addressing table. + + The source addressing table is referred to as a GID table. The GID index + identifies an entry into that table. The GID table exposed to a user + space process contains only those addresses usable by that process. + Queue pairs are frequently assigned a specific GID index to use for their + source network address when initially configured. + +*Device Context* +: Identifies an instance of an opened RDMA device. + +*command fd - cmd_fd* +: File descriptor used to communicate with the kernel device driver. + Associated with the device context and opened by the library. + The cmd_fd communicates with the kernel via ioctl’s and is used + to allocate, configure, and release device resources. + + Applications interact with the cmd_fd indirectly by calling libibverbs + function calls. + +*async_fd* +: File descriptor used to report asynchronous events. + Associated with the device context and opened by the library. + + Applications may interact directly with the async_fd, such as waiting + on the fd via select/poll, to receive notifications when an async event + has been reported. From a8a8b6659b16ddd3b8eac259709c6b14accabebe Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Mon, 9 Dec 2024 16:52:29 -0800 Subject: [PATCH 02/16] libibverbs: Introduce ultra ethernet transport support Ultra ethernet is a new connectionless transport that targets HPC and AI applications running at extreme scale. Introduce new node and transport types for devices that only support the new ultra ethernet transport. UET may be layered over UDP/IP using a well-known UDP port (similar to RoCEv2), or may be layered directly over IP. Define new GID types to allow users to select UET plus the underlying protocol layering (similar to how RoCEv1 and RoCEv2 are handled). Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index 821341242..a15403976 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -74,6 +74,8 @@ enum ibv_gid_type { IBV_GID_TYPE_IB, IBV_GID_TYPE_ROCE_V1, IBV_GID_TYPE_ROCE_V2, + IBV_GID_TYPE_UET_UDP, + IBV_GID_TYPE_UET_IP, }; struct ibv_gid_entry { From f03837f26b2349fdd20450108c1d6f33b05e8442 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Mon, 9 Dec 2024 18:22:23 -0800 Subject: [PATCH 03/16] libibverbs: Add support for UET QPs UET is designed around connectionless communication. To expose UET through verbs, we introduce a new reliable- unconnected QP type (named to align with existing QP types). Infiniband defines several states that a QP may be in. Many of the states are unsuitable for unconnected QPs in general and may not irrevelent depending on HW implementations. For UET, we define only 2 states for a UET QP: RTS and error. A UET QP is created in the ready-to-send state. To create a UET QP directly into the RTS state, the full set of QP attributes are needed at creation time. Struct ibv_qp_init_attr_ex is extended to include struct ibv_qp_attr for this purpose. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index a15403976..c5d1e3003 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -932,6 +932,7 @@ enum ibv_qp_type { IBV_QPT_RAW_PACKET = 8, IBV_QPT_XRC_SEND = 9, IBV_QPT_XRC_RECV, + IBV_QPT_RU, IBV_QPT_DRIVER = 0xff, }; @@ -961,6 +962,7 @@ enum ibv_qp_init_attr_mask { IBV_QP_INIT_ATTR_IND_TABLE = 1 << 4, IBV_QP_INIT_ATTR_RX_HASH = 1 << 5, IBV_QP_INIT_ATTR_SEND_OPS_FLAGS = 1 << 6, + IBV_QP_INIT_ATTR_QP_ATTR = 1 << 7, }; enum ibv_qp_create_flags { @@ -1015,6 +1017,9 @@ struct ibv_qp_init_attr_ex { uint32_t source_qpn; /* See enum ibv_qp_create_send_ops_flags */ uint64_t send_ops_flags; + + struct ibv_qp_attr *qp_attr; + int qp_attr_mask; }; enum ibv_qp_open_attr_mask { From d8f141adae5e690d673eb9a22dedc8b9d12409d7 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Mon, 14 Apr 2025 12:32:35 -0700 Subject: [PATCH 04/16] libibverbs: Add job id support Job IDs are used to identify a distributed application. The concept is widely used in HPC and AI applications, to identify a set of distributed processes as belonging to a single application. Job IDs are integral to ultra ethernet. A job ID is carried in every transport message and is part of a UET QP address. UEC defines that job IDs must be managed by a privileged entity. The association of a job ID to a specific QP is a protected operation. A simple view of the job security model is shown as this object model: device <--- job ID ^ ^ | | PD <--- job key ^ ^ ^ | \___ | (optional) QP --- MR This patch focuses on the job ID. Job keys are discussed in a following patch. We define new verb calls to allocate a job object. Each job object is assigned a unique ID. The assignment of ID values to job objects it outside the scope of the API, and would usually be handled through a job launcher or process manager. The ibv_alloc_job() call is use to create and configure a job object. It is expected that the kernel will enforce that callers have the proper privileges to create job objects on devices. (Similar to opening QP 0 or 1). Once a job object has been created, it may be shared with local processes using a shared fd mechanism. The creating process obtains a sharable fd using ibv_export_job() and exchanges the fd with the processes of the job (e.g. via sockets). On receiving the fd, the processes use ibv_import_job() to setup local job resources. A job is associated with addressing information, which includes protocol stack data, as well as an ID. The number of bytes of the ID which are valid is dependent on the associated protocol. For UET, it is 3-bytes. A job object performs an additional function beyond linking a QP with a job ID. It defines a mechanism by which local processes can share addressing information of peers. This can reduce the amount of memory used to store addresses locally and enables future optimizations, such as applying job level encryption. The feature will also map well to HPC and AI applications that identify peers using a rank. Conceptually, a virtual address array may be stored with a job object. Addresses are inserted or removed from the array at a given index location. The intent is that the index can map directly to the process' rank. When sending to a peer, the peer can be identified by the job plus the index. Note that the implementation for the job's addressing array is not defined. A vendor may implement this in a variety of ways. Addresses may be pre-inserted by the job launcher, and the transport addresses may be generated using an algorithm. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index c5d1e3003..d418bd4fb 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -363,6 +363,8 @@ struct ibv_device_attr_ex { struct ibv_pci_atomic_caps pci_atomic_caps; uint32_t xrc_odp_caps; uint32_t phys_port_cnt_ex; + uint32_t max_job_ids; + uint32_t max_addr_entries; }; enum ibv_mtu { @@ -2000,6 +2002,40 @@ struct ibv_flow_action_esp_attr { uint32_t esn; }; +struct ibv_job { + struct ibv_context *context; + void *user_context; + uint32_t handle; +}; + +struct ibv_job_attr { + uint32_t comp_mask; + unsigned int flags; + uint64_t id; + uint32_t max_addr_entries; + enum ibv_qp_type qp_type; + struct ibv_ah_attr ah_attr; +}; + +struct ibv_job * +ibv_alloc_job(struct ibv_context *context, struct ibv_job_attr *attr, + void *user_context); +int ibv_close_job(struct ibv_job *job); + +int ibv_insert_addr(struct ibv_job *job, uint32_t qpn, + struct ibv_ah_attr ah_attr, + unsigned int addr_idx, unsigned int flags); +int ibv_remove_addr(struct ibv_job *job, unsigned int addr_idx, + unsigned int flags); +int ibv_query_addr(struct ibv_job *job, unsigned int addr_idx, + uint32_t *qpn, struct ibv_ah_attr *ah_attr, + unsigned int flags); + +int ibv_export_job(struct ibv_job *job, int *fd); +int ibv_import_job(struct ibv_context *context, int fd, struct ibv_job **job); + +int ibv_query_job(struct ibv_job *job, struct ibv_job_attr *attr); + struct ibv_device; struct ibv_context; From 8cea5c15babfddeae4a2adb38234551d0637134a Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Tue, 15 Apr 2025 10:20:42 -0700 Subject: [PATCH 05/16] libibverbs: Add job key support The job object model can be viewed as: device <--- job ID ^ ^ | | PD <--- job key ^ ^ ^ | \___ | (optional) QP --- MR This patch introduces the job key object. The relationship between a job key and a job ID is similar to an lkey to a MR. A job object maps to a job ID value. Job objects are device level objects. A job key associates the job ID with a protection domain to provide process level protections. Job keys are associated with a 32-bit jkey value. The jkey will be used when posting a WR to associate a transfer with a specific job. That is, the jkey is what mirrors the lkey concept. The NIC converts the jkey to the job ID when transmitting packets on the wire, applying appropriate checks that the QP has access to the target job ID. E.g. the job key and QP belong to the same PD. UET allows a registered MR to optionally be accessible only to members of a specific job. The job key will also be used as an optional attribute when creating a MR. Details on associating a MR with a job key are defined in a later patch. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index d418bd4fb..fe88f9948 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -365,6 +365,7 @@ struct ibv_device_attr_ex { uint32_t phys_port_cnt_ex; uint32_t max_job_ids; uint32_t max_addr_entries; + uint32_t max_jkeys_per_pd; }; enum ibv_mtu { @@ -2036,6 +2037,17 @@ int ibv_import_job(struct ibv_context *context, int fd, struct ibv_job **job); int ibv_query_job(struct ibv_job *job, struct ibv_job_attr *attr); +struct ibv_job_key { + struct ibv_pd *pd; + uint32_t handle; + uint32_t jkey; +}; + +struct ibv_job_key * +ibv_create_jkey(struct ibv_pd *pd, struct ibv_job *job, unsigned int flags); +int ibv_destroy_jkey(struct ibv_job_key *job_key); + + struct ibv_device; struct ibv_context; From 7a77e43d8334628f1c6b380e4705f0251f3b6359 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 11 Dec 2024 10:50:29 -0800 Subject: [PATCH 06/16] libibverbs: Allow posting WRs for RU QPs Add new extended QP functions to set necessary input fields related to supporting RU QPs and UE transport. The UE transport supports 64-bits of immediate data and 64-bit rkeys. Provide expanded APIs to support both. Also include APIs to set full UET destination address data. UET QPs have an additional address component beyond the QP or endpoint address. They have a concept defined as a resource index. A resource index can be viewed as additional receive queues attached to the QP, which are directly addressable by a sender. One intended use of resource indices is to allow a single UET QP to separate traffic from different services. For example, HPC traffic may use one subset of indices, AI traffic a different subset, and storage a third. The number of resource indices supported by a QP is vendor specific, and how they are used by applications it outside the scope of the verbs API. The resource index concept reuses the verbs work queue concept A new send WR flag is also added, delivery complete. When requested and supported by the provider, this flag indicates that a completion for the send operation indicates that the data is globally observable at the target. This is an optional feature of the UE transport. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 58 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 57 insertions(+), 1 deletion(-) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index fe88f9948..97e6ba498 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -1160,7 +1160,8 @@ enum ibv_send_flags { IBV_SEND_SIGNALED = 1 << 1, IBV_SEND_SOLICITED = 1 << 2, IBV_SEND_INLINE = 1 << 3, - IBV_SEND_IP_CSUM = 1 << 4 + IBV_SEND_IP_CSUM = 1 << 4, + IBV_SEND_DELIVERY_COMPLETE = 1 << 5, }; enum ibv_placement_type { @@ -1390,6 +1391,19 @@ struct ibv_qp_ex { void (*wr_flush)(struct ibv_qp_ex *qp, uint32_t rkey, uint64_t remote_addr, size_t len, uint8_t type, uint8_t level); + + void (*wr_send_imm64)(struct ibv_qp_ex *qp, __be64 imm_data); + void (*wr_rdma_read64)(struct ibv_qp_ex *qp, uint64_t rkey, + uint64_t remote_addr); + void (*wr_rdma_write64)(struct ibv_qp_ex *qp, uint64_t rkey, + uint64_t remote_addr); + void (*wr_rdma_write64_imm)(struct ibv_qp_ex *qp, uint64_t rkey, + uint64_t remote_addr, __be64 imm_data); + void (*wr_set_ru_addr)(struct ibv_qp_ex *qp, struct ibv_ah *ah, + uint32_t remote_qpn, uint32_t jkey); + void (*wr_set_job_addr)(struct ibv_qp_ex *qp, unsigned int addr_idx, + uint32_t jkey); + void (*wr_set_wq_num)(struct ibv_qp_ex *qp, uint32_t wq_num); }; struct ibv_qp_ex *ibv_qp_to_qp_ex(struct ibv_qp *qp); @@ -1426,12 +1440,24 @@ static inline void ibv_wr_rdma_read(struct ibv_qp_ex *qp, uint32_t rkey, qp->wr_rdma_read(qp, rkey, remote_addr); } +static inline void ibv_wr_rdma_read64(struct ibv_qp_ex *qp, uint64_t rkey, + uint64_t remote_addr) +{ + qp->wr_rdma_read64(qp, rkey, remote_addr); +} + static inline void ibv_wr_rdma_write(struct ibv_qp_ex *qp, uint32_t rkey, uint64_t remote_addr) { qp->wr_rdma_write(qp, rkey, remote_addr); } +static inline void ibv_wr_rdma_write64(struct ibv_qp_ex *qp, uint64_t rkey, + uint64_t remote_addr) +{ + qp->wr_rdma_write64(qp, rkey, remote_addr); +} + static inline void ibv_wr_flush(struct ibv_qp_ex *qp, uint32_t rkey, uint64_t remote_addr, size_t len, uint8_t type, uint8_t level) @@ -1445,6 +1471,12 @@ static inline void ibv_wr_rdma_write_imm(struct ibv_qp_ex *qp, uint32_t rkey, qp->wr_rdma_write_imm(qp, rkey, remote_addr, imm_data); } +static inline void ibv_wr_rdma_write64_imm(struct ibv_qp_ex *qp, uint64_t rkey, + uint64_t remote_addr, __be64 imm_data) +{ + qp->wr_rdma_write64_imm(qp, rkey, remote_addr, imm_data); +} + static inline void ibv_wr_send(struct ibv_qp_ex *qp) { qp->wr_send(qp); @@ -1455,6 +1487,11 @@ static inline void ibv_wr_send_imm(struct ibv_qp_ex *qp, __be32 imm_data) qp->wr_send_imm(qp, imm_data); } +static inline void ibv_wr_send_imm64(struct ibv_qp_ex *qp, __be64 imm_data) +{ + qp->wr_send_imm64(qp, imm_data); +} + static inline void ibv_wr_send_inv(struct ibv_qp_ex *qp, uint32_t invalidate_rkey) { @@ -1473,6 +1510,25 @@ static inline void ibv_wr_set_ud_addr(struct ibv_qp_ex *qp, struct ibv_ah *ah, qp->wr_set_ud_addr(qp, ah, remote_qpn, remote_qkey); } +static inline void ibv_wr_set_ru_addr(struct ibv_qp_ex *qp, struct ibv_ah *ah, + uint32_t remote_qpn, uint32_t jkey) +{ + qp->wr_set_ru_addr(qp, ah, remote_qpn, jkey); +} + +static inline void ibv_wr_set_job_addr(struct ibv_qp_ex *qp, + unsigned int addr_idx, + uint32_t jkey) +{ + qp->wr_set_job_addr(qp, addr_idx, jkey); +} + +static inline void ibv_wr_set_wq_num(struct ibv_qp_ex *qp, + uint32_t wq_num) +{ + qp->wr_set_wq_num(qp, wq_num); +} + static inline void ibv_wr_set_xrc_srqn(struct ibv_qp_ex *qp, uint32_t remote_srqn) { From 387ed8e16b52a88b30f9f2658204f84986a190c1 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 11 Dec 2024 11:24:12 -0800 Subject: [PATCH 07/16] libibverbs: Report UET transport details in completions Allow UET specific information to be reported as part of work completions. This includes the larger immediate data size, the job ID carried in the transport header, and a peer ID, also carried in the transport header. Included with completion data is a UET transport field, called the initiator in UEC terminology. This is a user configurable value intended to map to the rank number for a parallel application. The initiator field only has meaning within a specific job ID. As a result, when the value is valid in a completion, so is the job ID. (For UET, the initiator value is part of the UET address.) The verbs naming of this field is the slightly more generic term, src_id, to align with src_qpn (in ibv_wc). Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index 97e6ba498..a5762e6ba 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -562,6 +562,8 @@ enum ibv_create_cq_wc_flags { IBV_WC_EX_WITH_FLOW_TAG = 1 << 9, IBV_WC_EX_WITH_TM_INFO = 1 << 10, IBV_WC_EX_WITH_COMPLETION_TIMESTAMP_WALLCLOCK = 1 << 11, + IBV_WC_EX_WITH_IMM64 = 1 << 12, + IBV_WC_EX_WITH_SRC_ID = 1 << 13, /* implies job id */ }; enum { @@ -1659,6 +1661,9 @@ struct ibv_cq_ex { void (*read_tm_info)(struct ibv_cq_ex *current, struct ibv_wc_tm_info *tm_info); uint64_t (*read_completion_wallclock_ns)(struct ibv_cq_ex *current); + __be64 (*read_imm64_data)(struct ibv_cq_ex *current); + uint64_t (*read_job_id)(struct ibv_cq_ex *current); + uint32_t (*read_src_id)(struct ibv_cq_ex *current); }; static inline struct ibv_cq *ibv_cq_ex_to_cq(struct ibv_cq_ex *cq) @@ -1717,6 +1722,11 @@ static inline __be32 ibv_wc_read_imm_data(struct ibv_cq_ex *cq) return cq->read_imm_data(cq); } +static inline __be64 ibv_wc_read_imm64_data(struct ibv_cq_ex *cq) +{ + return cq->read_imm64_data(cq); +} + static inline uint32_t ibv_wc_read_invalidated_rkey(struct ibv_cq_ex *cq) { #ifdef __CHECKER__ @@ -1736,6 +1746,16 @@ static inline uint32_t ibv_wc_read_src_qp(struct ibv_cq_ex *cq) return cq->read_src_qp(cq); } +static inline uint64_t ibv_wc_read_job_id(struct ibv_cq_ex *cq) +{ + return cq->read_job_id(cq); +} + +static inline uint32_t ibv_wc_read_src_id(struct ibv_cq_ex *cq) +{ + return cq->read_src_id(cq); +} + static inline unsigned int ibv_wc_read_wc_flags(struct ibv_cq_ex *cq) { return cq->read_wc_flags(cq); From 7279497b7410a2b222d110f83404e7d0048c0243 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 15 Oct 2025 11:00:58 -0700 Subject: [PATCH 08/16] libibverbs: Support memory registrations for UET The UET protocol and devices support advanced features for memory regions. From the viewpoint of the protocol, an rkey is 64-bits, with specific meaning applied to several of the bits. Struct ibv_mr is extended to report a 64-bit rkey. Providers are expected to set the 32-bit rkey and/or rkey64 field in struct ibv_mr correctly based on the transports supported by the device. A second protocol feature is that a MR may be restricted to being accessible by a specific job. Since a UET QP may be used to communicate with multiple jobs simultaneously, the memory registration call is expanded to allow associating a job key with a MR. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index a5762e6ba..3fd45a9a1 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -686,6 +686,7 @@ struct ibv_mr { uint32_t handle; uint32_t lkey; uint32_t rkey; + uint64_t rkey64; }; enum ibv_mr_init_attr_mask { @@ -694,6 +695,7 @@ enum ibv_mr_init_attr_mask { IBV_REG_MR_MASK_FD = 1 << 2, IBV_REG_MR_MASK_FD_OFFSET = 1 << 3, IBV_REG_MR_MASK_DMAH = 1 << 4, + IBV_REG_MR_MASK_JKEY = 1 << 5, }; struct ibv_mr_init_attr { @@ -705,6 +707,7 @@ struct ibv_mr_init_attr { int fd; uint64_t fd_offset; struct ibv_dmah *dmah; + struct ibv_job_key *jkey; }; enum ibv_mw_type { From 7794cae7963198fd2a5e355d313ea248e85e5949 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 30 Jul 2025 14:24:01 -0700 Subject: [PATCH 09/16] libibverbs: Support adjustable QP msg and data semantics UET defines multiple packet delivery modes: ROD - reliable, ordered delivery RUD - reliable, unordered delivery RUDI - reliable, unordered delivery for idempotent transfers UUD - unreliable, unordered delivery The packet delivery modes impact how out of order packets are handled at the receiver, retry mechanisms, multi-pathing support, and congestion control algorithms, among other behavior. A single UET QP may use multiple packet delivery modes simultaneously based on the application data transfer being performed. Even traditional RDMA protocols are evolving to allow greater flexibility in how message and data ordering are delivered at the receiver. This patch introduces a new QP attribute structure called QP semantics. This structure defines the message and data ordering requirements that a QP must implement. If a QP cannot meet the requested semantics, QP creation should fail, but a vendor can always provide stronger guarantees than those requested by the user. QP semantics indicate if the QP must provider message and data ordering guarantees, such as write-after-write, read- after-write, send-after-write, etc. Traditionally, these ordering guarantees were defined by the relevent RDMA specifications, and users of the libibverbs API needed to know to reference those specs in order to use a QP correctly (such as when to fence data transfers). As an alternative, a new device level query call is added, which can return the supported ordering guarantees for a given QP type over a specific transport. The QP semantics may optionally be passed into the create QP operation. After querying for supported semantics, applications can remove unneeded ordering guarantees in order to leverage available network features (such as multipath support). This allows vendors to adjust transport behavior accordingly. For example, UET can leverage ROD when sending messages, but use RUD or RUDI for RDMA transfers. Data ordering between messages is further defined by to indicate the maximum size transfer that ordering holds. For example, RDMA write-after-read ordering may be restricted to single MTU transfers. Finally, as a 'fix' to MTU sizes forced to being a power of 2, a max_pdu is introduced. The max PDU reports the maximum size of *user* data that can be carried in a single transport packet. The max PDU is relative to the port MTU, minus protocol headers. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 55 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index 3fd45a9a1..e4cff1cdd 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -114,6 +114,39 @@ enum ibv_transport_type { IBV_TRANSPORT_UNSPECIFIED, }; +enum ibv_qp_msg_order { + /* Atomic-Atomic Rd/Wr ordering */ + IBV_ORDER_ATOMIC_RAR = (1 << 0), + IBV_ORDER_ATOMIC_RAW = (1 << 1), + IBV_ORDER_ATOMIC_WAR = (1 << 2), + IBV_ORDER_ATOMIC_WAW = (1 << 3), + /* RDMA-RDMA Rd/Wr ordering */ + IBV_ORDER_RDMA_RAR = (1 << 4), + IBV_ORDER_RDMA_RAW = (1 << 5), + IBV_ORDER_RDMA_WAR = (1 << 6), + IBV_ORDER_RDMA_WAW = (1 << 7), + /* Send ordering wrt Atomic and RDMA Rd/Wr */ + IBV_ORDER_RAS = (1 << 8), + IBV_ORDER_SAR = (1 << 9), + IBV_ORDER_SAS = (1 << 10), + IBV_ORDER_SAW = (1 << 11), + IBV_ORDER_WAS = (1 << 12), + /* Atomic and RDMA Rd/Wr ordering */ + IBV_ORDER_RAR = (1 << 13), + IBV_ORDER_RAW = (1 << 14), + IBV_ORDER_WAR = (1 << 15), + IBV_ORDER_WAW = (1 << 16), +}; + +struct ibv_qp_semantics { + uint32_t comp_mask; + uint32_t msg_order; + uint32_t max_rdma_raw_size; + uint32_t max_rdma_war_size; + uint32_t max_rdma_waw_size; + uint32_t max_pdu; +}; + enum ibv_device_cap_flags { IBV_DEVICE_RESIZE_MAX_WR = 1, IBV_DEVICE_BAD_PKEY_CNTR = 1 << 1, @@ -971,6 +1004,7 @@ enum ibv_qp_init_attr_mask { IBV_QP_INIT_ATTR_RX_HASH = 1 << 5, IBV_QP_INIT_ATTR_SEND_OPS_FLAGS = 1 << 6, IBV_QP_INIT_ATTR_QP_ATTR = 1 << 7, + IBV_QP_INIT_ATTR_QP_SEMANTICS = 1 << 8, }; enum ibv_qp_create_flags { @@ -1028,6 +1062,7 @@ struct ibv_qp_init_attr_ex { struct ibv_qp_attr *qp_attr; int qp_attr_mask; + struct ibv_qp_semantics *qp_semantics; }; enum ibv_qp_open_attr_mask { @@ -2316,6 +2351,11 @@ struct ibv_values_ex { struct verbs_context { /* "grows up" - new fields go here */ + int (*query_qp_semantics)(struct ibv_context *context, + enum ibv_qp_type qp_type, + struct ibv_ah_attr *ah_attr, + struct ibv_qp_semantics *qp_semantics, + size_t qp_semantic_len); struct ibv_mr *(*reg_mr_ex)(struct ibv_pd *pd, struct ibv_mr_init_attr *mr_init_attr); int (*dealloc_dmah)(struct ibv_dmah *dmah); @@ -2658,6 +2698,21 @@ int ibv_query_pkey(struct ibv_context *context, uint8_t port_num, int ibv_get_pkey_index(struct ibv_context *context, uint8_t port_num, __be16 pkey); +static inline int ibv_query_qp_semantics(struct ibv_context *context, + enum ibv_qp_type qp_type, + struct ibv_ah_attr *ah_attr, + struct ibv_qp_semantics *qp_semantics, + size_t qp_semantic_len) +{ + struct verbs_context *vctx = verbs_get_ctx_op(context, query_qp_semantics); + + if (!vctx) + return EOPNOTSUPP; + + return vctx->query_qp_semantics(context, qp_type, ah_attr, + qp_semantics, qp_semantic_len); +} + /** * ibv_alloc_pd - Allocate a protection domain */ From 388704eda586b925332f60e0da31a565f41ea23a Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Tue, 29 Jul 2025 14:49:24 -0700 Subject: [PATCH 10/16] libibverbs: Allow provider to describe immediate data limits Legacy RDMA transports are restricted to 32-bits of immediate data, while UET supports 64-bits. Additionally, UET does not require that RDMA writes with immediate consume a posted receive buffer at the target. The spec even goes so far as to mandate that RDMA traffic be treated separately at the target than send operations; however, such a mandate is not visible in the transport and places restrictions on the NIC implementation. NICs that support multiple protocols, including UET, may be optimized for legacy RDMA support. For example, CQ entries may only be able to store 32-bits of immediate data. To handle different implementations and transports, we extend the QP semantic structure to report the immediate data size, as well as implementation constraints, such as the need to consume a posted receive buffer. This change has an added advantage that it is now possible for a user to indicate that immediate data will not be used by setting the size to 0 when creating the QP. For devices which support a smaller immediate data size than that carried by the transport, truncated immediate data is extended with 0s when writing to the wire, and completions report the lowest valid bits. The QP semantics are extended with a new use_flags. These flags will allow providers to direct applications on constraints on using the HW, allowing greater flexibility in implementations. When set, IBV_QP_USAGE_IMM_DATA_RQ indicates that RDMA writes with immediate data will consume a posted receive buffer on the QP. This is standard behavior for legacy RDMA transports, but not for UET. By setting this flag, a provider can indicate this as their default requirement even when using UET QPs. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index e4cff1cdd..5d852ac85 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -138,6 +138,10 @@ enum ibv_qp_msg_order { IBV_ORDER_WAW = (1 << 16), }; +enum ibv_qp_use_flags { + IBV_QP_USAGE_IMM_DATA_RQ = (1 << 0), +}; + struct ibv_qp_semantics { uint32_t comp_mask; uint32_t msg_order; @@ -145,6 +149,8 @@ struct ibv_qp_semantics { uint32_t max_rdma_war_size; uint32_t max_rdma_waw_size; uint32_t max_pdu; + uint8_t imm_data_size; + unsigned int usage_flags; }; enum ibv_device_cap_flags { From a3223c2dd674f365ad3f870d52b3174f3d425e29 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 15 Oct 2025 16:03:14 -0700 Subject: [PATCH 11/16] libibverbs: Define attaching a MR to a QP Legacy RDMA devices immediately expose a new MR as soon as the memory registration process completes. That is, even before reg_mr() returns to the caller, the region is accessible to any QP sharing the same PD. UET allows for greater control over access to a MR. Even once a MR has been created, exposure to the MR is treated as a separate operation. This further allows access to a MR to be invoked without it being destroyed, which enables a MR to be used-once. E.g. The MR may be the target of a single RDMA operation, with access controlled by the owner of the MR. This behavior differs from the remote invalidate operation. To support this additional level of control, we introduce new QP operations: attach MR and detach MR. A provider indicates that MRs must be explicitly attached to a QP through a new QP usage flag, as this behavior may be specific to a given transport protocol + QP type. E.g. UET + RU QPs may support MR attachment, but UET + UD QPs may not (since the feature is not required). Support and the need to attach a MR to a QP is indicated by the IBV_QP_USAGE_ATTACH_MR usage flag. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index 5d852ac85..692494abe 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -140,6 +140,7 @@ enum ibv_qp_msg_order { enum ibv_qp_use_flags { IBV_QP_USAGE_IMM_DATA_RQ = (1 << 0), + IBV_QP_USAGE_ATTACH_MR = (1 << 1), }; struct ibv_qp_semantics { @@ -2357,6 +2358,8 @@ struct ibv_values_ex { struct verbs_context { /* "grows up" - new fields go here */ + int (*attach_mr)(struct ibv_qp *qp, struct ibv_mr *mr); + int (*detach_mr)(struct ibv_qp *qp, struct ibv_mr *mr); int (*query_qp_semantics)(struct ibv_context *context, enum ibv_qp_type qp_type, struct ibv_ah_attr *ah_attr, @@ -2925,6 +2928,22 @@ static inline int ibv_dealloc_mw(struct ibv_mw *mw) return mw->context->ops.dealloc_mw(mw); } +static inline int ibv_attach_mr(struct ibv_qp *qp, struct ibv_mr *mr) +{ + struct verbs_context *vctx = verbs_get_ctx_op(qp->context, attach_mr); + + if (!vctx) + return EOPNOTSUPP; + + return vctx->attach_mr(qp, mr); +} + +static inline int ibv_detach_mr(struct ibv_qp *qp, struct ibv_mr *mr) +{ + struct verbs_context *vctx = verbs_get_ctx_op(qp->context, attach_mr); + return vctx->detach_mr(qp, mr); +} + /** * ibv_inc_rkey - Increase the 8 lsb in the given rkey */ From 0451f913c1b55da3a1073f1204de1c92835df348 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 15 Oct 2025 14:41:39 -0700 Subject: [PATCH 12/16] libibverbs: Add support for user to select the rkey UET allows for user selected rkey values to improve scalability. Expose support via a device capability flag and update memory registration accordingly. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index 692494abe..7fe8d7945 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -193,6 +193,7 @@ enum ibv_fork_status { */ #define IBV_DEVICE_RAW_SCATTER_FCS (1ULL << 34) #define IBV_DEVICE_PCI_WRITE_END_PADDING (1ULL << 36) +#define IBV_DEVICE_USER_RKEY (1ULL << 37) enum ibv_atomic_cap { IBV_ATOMIC_NONE, @@ -736,6 +737,7 @@ enum ibv_mr_init_attr_mask { IBV_REG_MR_MASK_FD_OFFSET = 1 << 3, IBV_REG_MR_MASK_DMAH = 1 << 4, IBV_REG_MR_MASK_JKEY = 1 << 5, + IBV_REG_MR_MASK_RKEY = 1 << 6, }; struct ibv_mr_init_attr { @@ -748,6 +750,7 @@ struct ibv_mr_init_attr { uint64_t fd_offset; struct ibv_dmah *dmah; struct ibv_job_key *jkey; + uint64_t rkey; }; enum ibv_mw_type { From 2d3fca9842e1e7814af2e3256bae652a806ec98c Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 15 Oct 2025 17:49:51 -0700 Subject: [PATCH 13/16] libibverbs: Add support for 'derived' MRs Introduce a concept called derived memory regions. Derived MRs are similar to legacy RDMA memory windows, but setup through the memory registration API, rather than post send. Derived MRs are new MRs that are wholy contained within an existing MR (to share page mappings, for example), but have different access rights or other attributes. For UET, a derived MR allows a MR to be associated with different jobs, with the access for each job to be different, while still being able to share the underlying HW page mappings. Applications must assume that a derived MR holds a reference on the original MR. The original MR may not be destroyed until all derived MRs have been closed. When a MR is created, a derive_cnt field may be provided to indicate the number of expected derived MRs that an application intends to create. This field is considered an optimization and may be ignored by the provider. Providers that do not support derived MRs may simply create a new MR without sharing resources with the original MR. A derived MR is subject to reported provider restrictions, such as IBV_QP_USAGE_ATTACH_MR. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index 7fe8d7945..ac6a32a80 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -738,6 +738,8 @@ enum ibv_mr_init_attr_mask { IBV_REG_MR_MASK_DMAH = 1 << 4, IBV_REG_MR_MASK_JKEY = 1 << 5, IBV_REG_MR_MASK_RKEY = 1 << 6, + IBV_REG_MR_MASK_CUR_MR = 1 << 7, + IBV_REG_MR_MASK_DERIVE_CNT = 1 << 8, }; struct ibv_mr_init_attr { @@ -751,6 +753,8 @@ struct ibv_mr_init_attr { struct ibv_dmah *dmah; struct ibv_job_key *jkey; uint64_t rkey; + struct ibv_mr *cur_mr; + uint32_t derive_cnt; }; enum ibv_mw_type { From bba2936255d2b1e15f48f17e6e7b4778d2d40ecc Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Thu, 16 Oct 2025 12:54:39 -0700 Subject: [PATCH 14/16] libibverbs: Add UET initiator setting The UET initiator is equivalent to an MPI rank or CCL communicator ID. It is a user settable value used for tag matching purposes. UET carries the initiator field directly in the transport header. Extend the initiator QP attributes to allow user to set the value. We use the more generic term, src_id, instead of the UET specific term. The naming is aligned with src_qpn in ibv_wc. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index ac6a32a80..4d870e17a 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -1019,6 +1019,7 @@ enum ibv_qp_init_attr_mask { IBV_QP_INIT_ATTR_SEND_OPS_FLAGS = 1 << 6, IBV_QP_INIT_ATTR_QP_ATTR = 1 << 7, IBV_QP_INIT_ATTR_QP_SEMANTICS = 1 << 8, + IBV_QP_INIT_ATTR_SRC_ID = 1 << 9, }; enum ibv_qp_create_flags { @@ -1077,6 +1078,7 @@ struct ibv_qp_init_attr_ex { struct ibv_qp_attr *qp_attr; int qp_attr_mask; struct ibv_qp_semantics *qp_semantics; + uint32_t src_id; }; enum ibv_qp_open_attr_mask { From 85cc0e79eba4725f4551fffe218e93ed913fe29b Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Thu, 16 Oct 2025 14:50:53 -0700 Subject: [PATCH 15/16] libibverbs: Extend ibv_wq to support UET resource index UET associates multiple receive queues with a single queue pair. In UET terms, a QP maps to a PIDonFEP, and the receive queues are known as resource indices. Resource indices allow for receive side resources to be separated, such that they may be dedicated to separate services (e.g. MPI, CCL, storage). To support separate resources, we reuse the verbs work queue objects (ibv_wq). The API is extended slightly for UET. First, we add an extended device attribute, max_rqw_per_qp, to limit the number of WQs which may be associated with a QP. Secondly, we extend the WQ attributes to allow the user to select the wq_num (i.e. UET resource index) associated with a WQ. It is the responsibility of higher-level SW to allocate, configure, and associate WQs with QPs, so that the QP is assigned the correct number of WQs with the necessary addresses. Signed-off-by: Sean Hefty --- libibverbs/verbs.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index 4d870e17a..bcf0f3ab7 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -407,6 +407,7 @@ struct ibv_device_attr_ex { uint32_t max_job_ids; uint32_t max_addr_entries; uint32_t max_jkeys_per_pd; + uint16_t max_rwq_per_qp; }; enum ibv_mtu { @@ -906,6 +907,7 @@ enum ibv_wq_type { enum ibv_wq_init_attr_mask { IBV_WQ_INIT_ATTR_FLAGS = 1 << 0, IBV_WQ_INIT_ATTR_RESERVED = 1 << 1, + IBV_WQ_INIT_ATTR_WQ_NUM = 1 << 2, }; enum ibv_wq_flags { @@ -925,6 +927,7 @@ struct ibv_wq_init_attr { struct ibv_cq *cq; uint32_t comp_mask; /* Use ibv_wq_init_attr_mask */ uint32_t create_flags; /* use ibv_wq_flags */ + uint32_t wq_num; }; enum ibv_wq_state { From 942bee0f60468350876f1a14444f5cbc513b01fd Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Fri, 24 Oct 2025 14:33:41 -0700 Subject: [PATCH 16/16] libibverbs: Update API documentation with UET job concepts Include descriptions of new objects introduced for UET: job, jkey, and address table, with verbs semantic constructs definitions. Signed-off-by: Sean Hefty --- Documentation/libibverbs.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/Documentation/libibverbs.md b/Documentation/libibverbs.md index 0f7984382..902b2d45d 100644 --- a/Documentation/libibverbs.md +++ b/Documentation/libibverbs.md @@ -349,3 +349,29 @@ Items are ordered conceptually. Applications may interact directly with the async_fd, such as waiting on the fd via select/poll, to receive notifications when an async event has been reported. + +*Job ID* +: A job ID identifies a single distributed application. The job object + is a device-level object that maps to a job ID and may be shared between + processes. The configuration of a job object, such as assigning its + job ID value, is considered a privileged operation. + + Multiple job objects, each assigned the same job ID value, may be needed + to represent a single, higher-level logical job running on the network. + This may be nessary for jobs that span multiple RDMA devices, for + example, where each job object may be configured for different source + addressing. + +*Job Key* +: A job key associates a job object with a specific protection domain. This + provides secure access to the actual job ID value stored with the job + object, while restricting which memory regions data transfers to / from + that job may access. + +*Address Table* +: An address table is a virtual address array associated with a job object. + The address table allows local processes that belong to the same job to + share addressing and scalable encryption information to peer QPs. + + The address table is an optional but integrated component to a job + object.