Add GPU implementation of preconditioner - v2 #31

LucasBoTang · 2025-11-15T09:11:43Z

Summary

This PR introduces a CUDA-based implementation of the preconditioner module, including both Ruiz, Pock–Chambolle, and objective-bound scaling.

Main Changes

Replaced preconditioner.cu with preconditioner.c
Modified initialize_solver_state in solver.cu for GPU preconditioner integration

Implementation Details

The matrix is stored in CSR format, with an additional row ID array to enable efficient row-wise scaling (A[i,j] *= E[i]) without extra lookups.
Added an auxiliary array recording the mapping of each A element to its corresponding position in Aᵀ, enabling synchronized scaling of A and Aᵀ without atomics or additional CSR/CSC conversions.

Next Step

Benchmark GPU vs CPU preconditioner runtime before merging

Note

Reduce_bound_norm_sq_atomic currently relies on atomicAdd(double*) for the bound-norm reduction, which requires CMAKE_CUDA_ARCHITECTURES ≥ 60.

Would it be preferable to:

Switch to a portable single-block shared-memory reduction (no atomics), or
Redesign the reduction kernel
Keep the current implementation and require sm_60+?

LucasBoTang · 2025-11-16T02:46:14Z

This update fixes two issues in the GPU preconditioner:

Corrected objective/bound rescaling: The previous GPU code applied the wrong scaling to the bounds and objective. This caused very long PDHG iterations. Now, all scaling is applied correctly on the GPU.
Improved reduce_bound_norm_sq_kernel: Replaced atomic accumulation with a shared-memory block reduction. This removes the atomic overhead and makes the result consistent and fast.

LucasBoTang · 2025-11-25T01:55:24Z

Hi @ZedongPeng and @jinwen-yang,

I have now completed the experiments on MILPLIB and the Mittelmann LP benchmark.

Below is an example figure comparing CPU vs GPU preconditioning time across all tested instances:

For details of the result and plot, please see the following notebook: Read_Summary.ipynb

Overall, the results show that GPU preconditioning is consistently faster than CPU preconditioning, often by several orders of magnitude across the benchmarks. The few cases in which the CPU appears marginally faster occur only when the preconditioning time on the CPU is already very small (< 0.1 sec).

Consequently, because the overall solve time is typically dominated by the iterative optimization phase rather than the preconditioning step, the choice between CPU and GPU preconditioning has only a minor impact on the solve time.

LucasBoTang · 2025-11-25T02:04:44Z

Solving timing observations

During the full MILPLIB and Mittelmann benchmark runs, I observed some minor numerical mismatches between the CPU and GPU runs. These differences occur only in a small subset of instances.

Below is the full list of mismatches detected:

[WARNING] mismatch in Primal obj for instance ns1687037:
         Status Iterations   Primal obj      Dual obj
CPU  TIME_LIMIT  247379000  4.673210786  -250.2518284
GPU  TIME_LIMIT  250602000   4.64700347  -249.3609757
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched3-3:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   52986800  2432689.742  2432616.795
GPU  OPTIMAL   54340200   2432694.06  2432526.403
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance bdry2:
      Status Iterations      Primal obj        Dual obj
CPU  OPTIMAL   33554200  0.004177643673  0.004222602356
GPU  OPTIMAL   33558800  0.004177664638  0.004227880325
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance piperout-d20:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1381600  25533.84286  25533.84195
GPU  OPTIMAL    1560400  25533.94809  25533.84247
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance sct1:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL      49000   -218.089755  -218.1330731
GPU  OPTIMAL      49000  -218.0897577  -218.1330731
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance proteindesign121hz512p9:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1898200   1423.90403  1423.962777
GPU  OPTIMAL    2684200  1423.904585  1423.944118
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance sct5:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5700  -228.1952931  -228.1610876
GPU  OPTIMAL       5700  -228.1952954  -228.1610876
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-76-50:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     772600  -14173883.52  -14174418.73
GPU  OPTIMAL     612600  -14173884.23  -14175314.28
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-75-50:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     549000  -14182312.02  -14182139.66
GPU  OPTIMAL     549000  -14182312.01  -14182139.61
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance hgms62:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL   20860200  -53405.66375  -53416.34387
GPU  OPTIMAL   20860200  -53405.66394  -53416.34369
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4413714-turia:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   25849200  43.47503004  43.47695371
GPU  OPTIMAL   26097200  43.47871915   43.4715827
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim7:
      Status Iterations   Primal obj    Dual obj
CPU  OPTIMAL     456200  2337.078721  2336.73184
GPU  OPTIMAL     456200   2337.07871  2336.73184
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-3208254-reiu:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       9000  -38604.86166  -38606.26268
GPU  OPTIMAL       9000  -38604.86167  -38606.26268
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance shs1042:
      Status Iterations  Primal obj     Dual obj
CPU  OPTIMAL    3663200  9442.47166  9444.253513
GPU  OPTIMAL    3663200  9442.47166  9444.253512
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched6-1:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     301200  7499.998066     7501.475
GPU  OPTIMAL     301200  7499.997697  7501.480112
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched3-4:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   72542200  1124642.272  1124806.832
GPU  OPTIMAL   52752600  1124688.273  1124585.589
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim4:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     929800  9.205938105  9.206941924
GPU  OPTIMAL     929800  9.205929986  9.206942036
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim2:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1300800  10.86447831  10.86236456
GPU  OPTIMAL    1284600  10.86393657  10.86326915
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-3229051-yass:
              Status Iterations        Primal obj     Dual obj
CPU  DUAL_INFEASIBLE        200  -1.992184297e+12  1441960.542
GPU  DUAL_INFEASIBLE        200  -1.992184298e+12  1441960.541
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites2-25:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5100  -30.00002622  -29.99759189
GPU  OPTIMAL       5100  -30.00002622  -29.99759469
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance ds-big:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     933200  86.83684649  86.81978274
GPU  OPTIMAL     934000  86.83691952  86.81978541
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-76-40:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     191800  -14171085.83  -14171901.21
GPU  OPTIMAL     191800  -14171085.81   -14171901.2
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance germanrr:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     345400  45980142.71  45980135.42
GPU  OPTIMAL     347000  45980143.85  45980135.42
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4332810-sesia:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     740600  -33931.28309   -33930.7782
GPU  OPTIMAL    2763400  -33929.56003  -33926.81054
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance proteindesign122trx11p8:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1416200  1720.461027  1720.694719
GPU  OPTIMAL    1416200  1720.461026  1720.694715
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4332801-seret:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    3552400  222338.3266  222313.5325
GPU  OPTIMAL    1400800  222342.9597  222320.9406
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4535459-waipa:
         Status Iterations       Primal obj          Dual obj
CPU  TIME_LIMIT    7867000  3.059144606e+15  -4.360493003e+10
GPU  TIME_LIMIT    7886000   3.05706559e+15  -4.368560407e+10
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites3-25:
      Status Iterations   Primal obj      Dual obj
CPU  OPTIMAL       4900  -39.0003573  -38.99905447
GPU  OPTIMAL       4900  -39.0003573  -38.99905805
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites4-25:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5000  -41.00318391  -41.01144414
GPU  OPTIMAL       5000  -41.00318391  -41.01144562
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites2-40:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5300  -29.99997468  -30.00582128
GPU  OPTIMAL       5300  -29.99997468  -30.00581932
------------------------------------------------------------
Total mismatches: 30 / 414 instances

In most mismatched cases, the differences are at the level of floating-point noise, and both implementations return the same status and a very similar number of iterations.

However, a few instances stand out with noticeably larger discrepancies between the CPU and GPU runs:

neos-4332810-sesia
- CPU: OPTIMAL, 740,600 iterations
  - Primal obj: -33931.28309
  - Dual obj: -33930.77820
- GPU: OPTIMAL, 2,763,400 iterations
  - Primal obj: -33929.56003
  - Dual obj: -33926.81054
neos-4332801-seret
- CPU: OPTIMAL, 3,552,400 iterations
  - Primal obj: 222338.3266
  - Dual obj: 222313.5325
- GPU: OPTIMAL, 1,400,800 iterations
  - Primal obj: 222342.9597
  - Dual obj: 222320.9406

Preconditioner timing observations

I also looked at the GPU preconditioner timing. Almost all instances have very cheap preconditioning on the GPU (Ruiz + Pock–Chambolle + bound-objective are usually on the order of 1e-3 seconds in total).

The only clear outliers are again neos-4332810-sesia and neos-4332801-seret. For these two instances, the GPU preconditioner times are:

neos-4332810-sesia
- Ruiz scaling (10 iterations): 2.155691 sec
- Pock-Chambolle scaling (alpha=1.0): 0.613833 sec
- Bound-objective scaling: 0.020169 sec
neos-4332801-seret
- Ruiz scaling (10 iterations): 2.158601 sec
- Pock-Chambolle scaling (alpha=1.0): 0.607071 sec
- Bound-objective scaling: 0.020089 sec

So these are currently the only instances where GPU preconditioning exceeds ~2 seconds; all other instances are around the 1e-3 second level for the same stages. Most of the extra time comes specifically from the Ruiz scaling stage, which is significantly more expensive in these instances compared to the rest of the benchmark set.

For completeness, here are the CPU preconditioner times for the same two problematic instances:

neos-4332810-sesia
- Ruiz scaling (10 iterations): 3.481281 sec
- Pock-Chambolle scaling (alpha=1.0): 0.946331 sec
- Bound-objective scaling: 0.100667 sec
neos-4332801-seret
- Ruiz scaling (10 iterations): 3.377352 sec
- Pock-Chambolle scaling (alpha=1.0): 0.930314 sec
- Bound-objective scaling: 0.091834 sec

As expected, CPU preconditioning is consistently slower on these instances. However, the GPU times for these two cases are still unusually large.

CMakeLists.txt

src/preconditioner.cu

src/solver.cu

internal/internal_types.h

ZedongPeng · 2025-11-26T03:57:50Z

pyproject.toml

 [project]
 name = "cupdlpx"
-version = "0.1.3"
+version = "0.1.4"


I updated the versioning setup in the project, so the C version no longer depends on pyproject.toml.
I’m also considering moving the Python interface to a separate repository for better maintenance and organization. What do you think?

I think that’s a great idea. We need to do that later.

…to preconditioner-v2

LucasBoTang added 3 commits November 14, 2025 18:33

New feat: GPU precondition

2a31c99

New feat: synchronized A/At scaling

5a993cd

Todo: infeasible is stucked

c2c7252

LucasBoTang requested review from ZedongPeng and jinwen-yang November 15, 2025 09:11

LucasBoTang added 2 commits November 15, 2025 16:45

New feat: preconditioner prints

63b2526

Bug fixed: objective & bound rescale

3f17100

LucasBoTang and others added 4 commits November 15, 2025 21:46

Merge branch 'main' into preconditioner-v2

e2149c3

New feat: record precondition time

1989b6b

Merge branch 'main' into preconditioner-v2

ae49b87

feat: add logging for precondition time

32889ca

Bug fixed: numerical issue

e3fd294

ZedongPeng reviewed Nov 26, 2025

View reviewed changes

LucasBoTang added 8 commits November 27, 2025 14:38

Clean code: clang-format and align naming style

7381654

Clean code: rename variables

f13f572

Refactor: move finite-bound computation

5ec9302

Merge branch 'MIT-Lu-Lab:main' into preconditioner-v2

af67331

Numerical issue: back to old version

61e88bd

Merge remote-tracking branch 'upstream/main' into preconditioner-v2

295e2cc

New ver: 0.1.5

51685be

Merge branch 'preconditioner-v2' of github.com:LucasBoTang/cuPDLPx in…

ec8532d

…to preconditioner-v2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GPU implementation of preconditioner - v2 #31

Add GPU implementation of preconditioner - v2 #31

Uh oh!

LucasBoTang commented Nov 15, 2025

Uh oh!

LucasBoTang commented Nov 16, 2025 •

edited

Loading

Uh oh!

LucasBoTang commented Nov 25, 2025 •

edited

Loading

Uh oh!

LucasBoTang commented Nov 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZedongPeng Nov 26, 2025

Uh oh!

LucasBoTang Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add GPU implementation of preconditioner - v2 #31

Are you sure you want to change the base?

Add GPU implementation of preconditioner - v2 #31

Uh oh!

Conversation

LucasBoTang commented Nov 15, 2025

Summary

Main Changes

Implementation Details

Next Step

Note

Uh oh!

LucasBoTang commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasBoTang commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasBoTang commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solving timing observations

Preconditioner timing observations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZedongPeng Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

LucasBoTang Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LucasBoTang commented Nov 16, 2025 •

edited

Loading

LucasBoTang commented Nov 25, 2025 •

edited

Loading

LucasBoTang commented Nov 25, 2025 •

edited

Loading