Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33
+291
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a significant performance optimization for PDLPx, particularly for problems with highly sparse constraint matrices.
When a row of the constraint matrix (or its transpose) is highly sparse (i.e., has very few non-zeros), launching a full CUSPARSE SpMV kernel for the primal or dual update can be inefficient due to kernel launch overhead and low computational density.
This change introduces two new, "fused" CUDA kernels:
fused_compute_next_pdhg_primal_solution_kernelfused_compute_next_pdhg_dual_solution_kernelThese kernels perform the sparse matrix-vector multiplication (SpMV) using a simple
for-loop (which is more efficient for highly sparse rows) and fuse it with the subsequent PDHG update logic (e.g., projection onto bounds, reflection). This approach avoids the overhead of separate kernel launches and improves data locality.Implementation Details
A^T @ dual_solution) and fuses it with the primal variable update, projection (againstvar_lb,var_ub), and reflection.A @ primal_solution) and fuses it with the dual variable update, projection (againstconst_lb,const_ub), and reflection.Performance Improvements
This fusion results in substantial performance gains, as demonstrated on Hans' Benchmark and the MIPLIB dataset.
Hans' Benchmark Examples
MIPLIB Dataset Summary
The results across the MIPLIB dataset are excellent. Both methods were run for the same number of iterations.
There are 169 instances using fused update according to the auto-selection.