Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Lhongpei · 2025-11-17T14:49:00Z

Description

This PR introduces a significant performance optimization for PDLPx, particularly for problems with highly sparse constraint matrices.

When a row of the constraint matrix (or its transpose) is highly sparse (i.e., has very few non-zeros), launching a full CUSPARSE SpMV kernel for the primal or dual update can be inefficient due to kernel launch overhead and low computational density.

This change introduces two new, "fused" CUDA kernels:

fused_compute_next_pdhg_primal_solution_kernel
fused_compute_next_pdhg_dual_solution_kernel

These kernels perform the sparse matrix-vector multiplication (SpMV) using a simple for-loop (which is more efficient for highly sparse rows) and fuse it with the subsequent PDHG update logic (e.g., projection onto bounds, reflection). This approach avoids the overhead of separate kernel launches and improves data locality.

Implementation Details

Fused Primal Kernel: Computes the dual product (A^T @ dual_solution) and fuses it with the primal variable update, projection (against var_lb, var_ub), and reflection.
Fused Dual Kernel: Computes the primal product (A @ primal_solution) and fuses it with the dual variable update, projection (against const_lb, const_ub), and reflection.
Auto-Algorithm Selection: The new fused kernel path is automatically selected for a primal or dual update when the number of non-zeros in each row (or column) is less than 100 and density is less than 0.01, which can be tuned further. For denser matrices, the existing CUSPARSE-based update is retained.

Performance Improvements

This fusion results in substantial performance gains, as demonstrated on Hans' Benchmark and the MIPLIB dataset.

Hans' Benchmark Examples

Model	Iterations	Previous (CUSPARSE)	Fused Kernel	Speedup
cont11	799200	31.23s	7.68s	4.07x
thk48	18000	18.79s	13.21s	1.42x

MIPLIB Dataset Summary

The results across the MIPLIB dataset are excellent. Both methods were run for the same number of iterations.
There are 169 instances using fused update according to the auto-selection.

Metric	CUSPARSE Based Update	Fused Update
GEOMEAN	0.369009528	0.200248621
SGM10	2.34091312	1.69636033
Better Count	3 / 169	166 / 169
Mean Relative Improvement	-	32.47%

Lhongpei added 3 commits November 17, 2025 02:58

Fused kernel for optimizing efficiency on sparse cases

310ed03

Improve kernel efficiency

0e9b044

Merge branch 'main' into fused_update

f1ffb71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Lhongpei commented Nov 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Are you sure you want to change the base?

Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Conversation

Lhongpei commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementation Details

Performance Improvements

Hans' Benchmark Examples

MIPLIB Dataset Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lhongpei commented Nov 17, 2025 •

edited

Loading