Skip to content

Commit 4be4f6a

Browse files
committed
Merge: mlx5: drivers update up to Linux v6.12
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5392 Hi all, This MR includes updates for the mlx5 drivers. It has backports of changes up to kernel v6.12. JIRA: https://issues.redhat.com/browse/RHEL-52869 JIRA: https://issues.redhat.com/browse/RHEL-52874 JIRA: https://issues.redhat.com/browse/RHEL-52876 JIRA: https://issues.redhat.com/browse/RHEL-69658 JIRA: https://issues.redhat.com/browse/RHEL-69680 CVE: CVE-2024-53120 CVE: CVE-2024-53121 Omitted-fix: de88df0 ("net/smc: Fix lookup of netdev by using ib_device_get_netdev()") Commit de88df0 is a fix for commit 5490357 which is not backported in this branch. All patches are accepted upstream in Linus' tree. Each patch commit message describes its origin. This patch set passed incremental build testing to verify that it is bisectable. Sanity tests ran over mlx5 drivers on x86_64 systems (using ConnectX-4/5/6), including the following: Ethernet: -- IPv4 traffic (ICMP, TCP, UDP). -- IPv6 traffic (ICMP, TCP, UDP). VLAN: -- IPv4 traffic (ICMP, TCP, UDP). -- IPv6 traffic (ICMP, TCP, UDP). RoCE: -- RDMA (ibv_*_pingpong). -- RDMACM (examples that comes with librdmacm packages). Infiniband: -- RDMA (ibv_*_pingpong). -- RDMACM (examples that comes with librdmacm packages). IPoIB: -- IPv4 traffic (ICMP, TCP, UDP). -- IPv6 traffic (ICMP, TCP, UDP). PKey: -- IPv4 traffic (ICMP, TCP, UDP). -- IPv6 traffic (ICMP, TCP, UDP). ASAP2/OVS: -- Various sanity tests covering OVS offloads. NFSoRDMA: -- Discover, mount and write. iSER: -- Discover, login and mount. SRP: -- Verify srp_daemon service is up and system can discover SRP targets. Signed-off-by: Benjamin Poirier <bpoirier@redhat.com> Approved-by: Kamal Heib <kheib@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2 parents e9985da + 61a1915 commit 4be4f6a

File tree

188 files changed

+25489
-2612
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

188 files changed

+25489
-2612
lines changed

Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst

Lines changed: 42 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -189,29 +189,51 @@ the software port.
189189

190190
* - `rx[i]_gro_packets`
191191
- Number of received packets processed using hardware-accelerated GRO. The
192-
number of hardware GRO offloaded packets received on ring i.
192+
number of hardware GRO offloaded packets received on ring i. Only true GRO
193+
packets are counted: only packets that are in an SKB with a GRO count > 1.
193194
- Acceleration
194195

195196
* - `rx[i]_gro_bytes`
196197
- Number of received bytes processed using hardware-accelerated GRO. The
197-
number of hardware GRO offloaded bytes received on ring i.
198+
number of hardware GRO offloaded bytes received on ring i. Only true GRO
199+
packets are counted: only packets that are in an SKB with a GRO count > 1.
198200
- Acceleration
199201

200202
* - `rx[i]_gro_skbs`
201-
- The number of receive SKBs constructed while performing
202-
hardware-accelerated GRO.
203-
- Informative
204-
205-
* - `rx[i]_gro_match_packets`
206-
- Number of received packets processed using hardware-accelerated GRO that
207-
met the flow table match criteria.
203+
- The number of GRO SKBs constructed from hardware-accelerated GRO. Only SKBs
204+
with a GRO count > 1 are counted.
208205
- Informative
209206

210207
* - `rx[i]_gro_large_hds`
211208
- Number of receive packets using hardware-accelerated GRO that have large
212209
headers that require additional memory to be allocated.
213210
- Informative
214211

212+
* - `rx[i]_hds_nodata_packets`
213+
- Number of header only packets in header/data split mode [#accel]_.
214+
- Informative
215+
216+
* - `rx[i]_hds_nodata_bytes`
217+
- Number of bytes for header only packets in header/data split mode
218+
[#accel]_.
219+
- Informative
220+
221+
* - `rx[i]_hds_nosplit_packets`
222+
- Number of packets that were not split in header/data split mode. A
223+
packet will not get split when the hardware does not support its
224+
protocol splitting. An example such a protocol is ICMPv4/v6. Currently
225+
TCP and UDP with IPv4/IPv6 are supported for header/data split
226+
[#accel]_.
227+
- Informative
228+
229+
* - `rx[i]_hds_nosplit_bytes`
230+
- Number of bytes for packets that were not split in header/data split
231+
mode. A packet will not get split when the hardware does not support its
232+
protocol splitting. An example such a protocol is ICMPv4/v6. Currently
233+
TCP and UDP with IPv4/IPv6 are supported for header/data split
234+
[#accel]_.
235+
- Informative
236+
215237
* - `rx[i]_lro_packets`
216238
- The number of LRO packets received on ring i [#accel]_.
217239
- Acceleration
@@ -300,6 +322,11 @@ the software port.
300322
in the beginning of the queue. This is a normal condition.
301323
- Informative
302324

325+
* - `tx[i]_timestamps`
326+
- Transmitted packets that were hardware timestamped at the device's DMA
327+
layer.
328+
- Informative
329+
303330
* - `tx[i]_added_vlan_packets`
304331
- The number of packets sent where vlan tag insertion was offloaded to the
305332
hardware.
@@ -702,6 +729,12 @@ the software port.
702729
the device typically ensures not posting the CQE.
703730
- Error
704731

732+
* - `ptp_cq[i]_lost_cqe`
733+
- Number of times a CQE is expected to not be delivered on the PTP
734+
timestamping CQE by the device due to a time delta elapsing. If such a
735+
CQE is somehow delivered, `ptp_cq[i]_late_cqe` is incremented.
736+
- Error
737+
705738
.. [#ring_global] The corresponding ring and global counters do not share the
706739
same name (i.e. do not follow the common naming scheme).
707740

Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,9 @@ Enabling the driver and kconfig options
130130

131131
| Build support for software-managed steering in the NIC.
132132
133+
**CONFIG_MLX5_HW_STEERING=(y/n)**
134+
135+
| Build support for hardware-managed steering in the NIC.
133136
134137
**CONFIG_MLX5_TC_CT=(y/n)**
135138

Documentation/networking/devlink/mlx5.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,10 @@ parameters.
9797

9898
When metadata is disabled, the above use cases will fail to initialize if
9999
users try to enable them.
100+
101+
Note: Setting this parameter does not take effect immediately. Setting
102+
must happen in legacy mode and eswitch port metadata takes effect after
103+
enabling switchdev mode.
100104
* - ``hairpin_num_queues``
101105
- u32
102106
- driverinit

Documentation/networking/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ Contents:
7373
mpls-sysctl
7474
mptcp-sysctl
7575
multiqueue
76+
multi-pf-netdev
7677
net_cachelines/index
7778
netconsole
7879
netdev-features
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
.. include:: <isonum.txt>
3+
4+
===============
5+
Multi-PF Netdev
6+
===============
7+
8+
Contents
9+
========
10+
11+
- `Background`_
12+
- `Overview`_
13+
- `mlx5 implementation`_
14+
- `Channels distribution`_
15+
- `Observability`_
16+
- `Steering`_
17+
- `Mutually exclusive features`_
18+
19+
Background
20+
==========
21+
22+
The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
23+
the network, each through its own dedicated PCIe interface. Through either a connection harness that
24+
splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
25+
results in eliminating the network traffic traversing over the internal bus between the sockets,
26+
significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
27+
network throughput.
28+
29+
Overview
30+
========
31+
32+
The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
33+
one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
34+
sysfs entry, and devlink are kept separate.
35+
Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
36+
traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
37+
proximity to the device and achieve improved performance.
38+
39+
mlx5 implementation
40+
===================
41+
42+
Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
43+
NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
44+
to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
45+
46+
The netdev network channels are distributed between all devices, a proper configuration would utilize
47+
the correct close NUMA node when working on a certain app/CPU.
48+
49+
We pick one PF to be a primary (leader), and it fills a special role. The other devices
50+
(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
51+
mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
52+
the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
53+
to/from the secondaries.
54+
55+
Currently, we limit the support to PFs only, and up to two PFs (sockets).
56+
57+
Channels distribution
58+
=====================
59+
60+
We distribute the channels between the different PFs to achieve local NUMA node performance
61+
on multiple NUMA nodes.
62+
63+
Each combined channel works against one specific PF, creating all its datapath queues against it. We
64+
distribute channels to PFs in a round-robin policy.
65+
66+
::
67+
68+
Example for 2 PFs and 5 channels:
69+
+--------+--------+
70+
| ch idx | PF idx |
71+
+--------+--------+
72+
| 0 | 0 |
73+
| 1 | 1 |
74+
| 2 | 0 |
75+
| 3 | 1 |
76+
| 4 | 0 |
77+
+--------+--------+
78+
79+
80+
The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
81+
mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
82+
As the channel stats are persistent across channel's closure, changing the mapping every single time
83+
would turn the accumulative stats less representing of the channel's history.
84+
85+
This is achieved by using the correct core device instance (mdev) in each channel, instead of them
86+
all using the same instance under "priv->mdev".
87+
88+
Observability
89+
=============
90+
The relation between PF, irq, napi, and queue can be observed via netlink spec::
91+
92+
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
93+
[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
94+
{'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
95+
{'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
96+
{'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
97+
{'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
98+
{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
99+
{'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
100+
{'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
101+
{'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
102+
{'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
103+
104+
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
105+
[{'id': 543, 'ifindex': 13, 'irq': 42},
106+
{'id': 542, 'ifindex': 13, 'irq': 41},
107+
{'id': 541, 'ifindex': 13, 'irq': 40},
108+
{'id': 540, 'ifindex': 13, 'irq': 39},
109+
{'id': 539, 'ifindex': 13, 'irq': 36}]
110+
111+
Here you can clearly observe our channels distribution policy::
112+
113+
$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
114+
/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
115+
/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
116+
/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
117+
/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
118+
/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
119+
120+
Steering
121+
========
122+
Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
123+
124+
In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
125+
traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
126+
that is capable of pointing to the receive queues of a different PF.
127+
128+
In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
129+
go out to the network through it.
130+
131+
In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
132+
PF on the same node as the CPU.
133+
134+
XPS default config example:
135+
136+
NUMA node(s): 2
137+
NUMA node0 CPU(s): 0-11
138+
NUMA node1 CPU(s): 12-23
139+
140+
PF0 on node0, PF1 on node1.
141+
142+
- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
143+
- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
144+
- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
145+
- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
146+
- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
147+
- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
148+
- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
149+
- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
150+
- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
151+
- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
152+
- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
153+
- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
154+
- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
155+
- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
156+
- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
157+
- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
158+
- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
159+
- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
160+
- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
161+
- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
162+
- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
163+
- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
164+
- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
165+
- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
166+
167+
Mutually exclusive features
168+
===========================
169+
170+
The nature of Multi-PF, where different channels work with different PFs, conflicts with
171+
stateful features where the state is maintained in one of the PFs.
172+
For example, in the TLS device-offload feature, special context objects are created per connection
173+
and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence,
174+
we disable this combination for now.

drivers/infiniband/core/device.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2271,6 +2271,9 @@ struct net_device *ib_device_get_netdev(struct ib_device *ib_dev,
22712271
if (!rdma_is_port_valid(ib_dev, port))
22722272
return NULL;
22732273

2274+
if (!ib_dev->port_data)
2275+
return NULL;
2276+
22742277
pdata = &ib_dev->port_data[port];
22752278

22762279
/*
@@ -2289,6 +2292,7 @@ struct net_device *ib_device_get_netdev(struct ib_device *ib_dev,
22892292

22902293
return res;
22912294
}
2295+
EXPORT_SYMBOL(ib_device_get_netdev);
22922296

22932297
/**
22942298
* ib_device_get_by_netdev - Find an IB device associated with a netdev

drivers/infiniband/core/roce_gid_mgmt.c

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,27 @@ void rdma_roce_rescan_device(struct ib_device *ib_dev)
515515
}
516516
EXPORT_SYMBOL(rdma_roce_rescan_device);
517517

518+
/**
519+
* rdma_roce_rescan_port - Rescan all of the network devices in the system
520+
* and add their gids if relevant to the port of the RoCE device.
521+
*
522+
* @ib_dev: IB device
523+
* @port: Port number
524+
*/
525+
void rdma_roce_rescan_port(struct ib_device *ib_dev, u32 port)
526+
{
527+
struct net_device *ndev = NULL;
528+
529+
if (rdma_protocol_roce(ib_dev, port)) {
530+
ndev = ib_device_get_netdev(ib_dev, port);
531+
if (!ndev)
532+
return;
533+
enum_all_gids_of_dev_cb(ib_dev, port, ndev, ndev);
534+
dev_put(ndev);
535+
}
536+
}
537+
EXPORT_SYMBOL(rdma_roce_rescan_port);
538+
518539
static void callback_for_addr_gid_device_scan(struct ib_device *device,
519540
u32 port,
520541
struct net_device *rdma_ndev,
@@ -575,16 +596,17 @@ static void handle_netdev_upper(struct ib_device *ib_dev, u32 port,
575596
}
576597
}
577598

578-
static void _roce_del_all_netdev_gids(struct ib_device *ib_dev, u32 port,
579-
struct net_device *event_ndev)
599+
void roce_del_all_netdev_gids(struct ib_device *ib_dev,
600+
u32 port, struct net_device *ndev)
580601
{
581-
ib_cache_gid_del_all_netdev_gids(ib_dev, port, event_ndev);
602+
ib_cache_gid_del_all_netdev_gids(ib_dev, port, ndev);
582603
}
604+
EXPORT_SYMBOL(roce_del_all_netdev_gids);
583605

584606
static void del_netdev_upper_ips(struct ib_device *ib_dev, u32 port,
585607
struct net_device *rdma_ndev, void *cookie)
586608
{
587-
handle_netdev_upper(ib_dev, port, cookie, _roce_del_all_netdev_gids);
609+
handle_netdev_upper(ib_dev, port, cookie, roce_del_all_netdev_gids);
588610
}
589611

590612
static void add_netdev_upper_ips(struct ib_device *ib_dev, u32 port,

drivers/infiniband/hw/mlx5/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ mlx5_ib-y := ah.o \
66
cong.o \
77
counters.o \
88
cq.o \
9+
data_direct.o \
910
dm.o \
1011
doorbell.o \
1112
gsi.o \

0 commit comments

Comments
 (0)