Skip to content

Conversation

@LucasBoTang
Copy link
Collaborator

Summary

This PR introduces a CUDA-based implementation of the preconditioner module, including both Ruiz, Pock–Chambolle, and objective-bound scaling.

Main Changes

  • Replaced preconditioner.cu with preconditioner.c
  • Modified initialize_solver_state in solver.cu for GPU preconditioner integration

Implementation Details

  • The matrix is stored in CSR format, with an additional row ID array to enable efficient row-wise scaling (A[i,j] *= E[i]) without extra lookups.
  • Added an auxiliary array recording the mapping of each A element to its corresponding position in Aᵀ, enabling synchronized scaling of A and Aᵀ without atomics or additional CSR/CSC conversions.

Next Step

  • Benchmark GPU vs CPU preconditioner runtime before merging

Note

Reduce_bound_norm_sq_atomic currently relies on atomicAdd(double*) for the bound-norm reduction, which requires CMAKE_CUDA_ARCHITECTURES ≥ 60.

Would it be preferable to:

  • Switch to a portable single-block shared-memory reduction (no atomics), or
  • Redesign the reduction kernel
  • Keep the current implementation and require sm_60+?

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Nov 16, 2025

This update fixes two issues in the GPU preconditioner:

  1. Corrected objective/bound rescaling: The previous GPU code applied the wrong scaling to the bounds and objective. This caused very long PDHG iterations. Now, all scaling is applied correctly on the GPU.

  2. Improved reduce_bound_norm_sq_kernel: Replaced atomic accumulation with a shared-memory block reduction. This removes the atomic overhead and makes the result consistent and fast.

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Nov 25, 2025

Hi @ZedongPeng and @jinwen-yang,

I have now completed the experiments on MILPLIB and the Mittelmann LP benchmark.

Below is an example figure comparing CPU vs GPU preconditioning time across all tested instances:

For details of the result and plot, please see the following notebook: Read_Summary.ipynb

Overall, the results show that GPU preconditioning is consistently faster than CPU preconditioning, often by several orders of magnitude across the benchmarks. The few cases in which the CPU appears marginally faster occur only when the preconditioning time on the CPU is already very small (< 0.1 sec).

Consequently, because the overall solve time is typically dominated by the iterative optimization phase rather than the preconditioning step, the choice between CPU and GPU preconditioning has only a minor impact on the solve time.

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Nov 25, 2025

Solving timing observations

During the full MILPLIB and Mittelmann benchmark runs, I observed some minor numerical mismatches between the CPU and GPU runs. These differences occur only in a small subset of instances.

Below is the full list of mismatches detected:

[WARNING] mismatch in Primal obj for instance ns1687037:
         Status Iterations   Primal obj      Dual obj
CPU  TIME_LIMIT  247379000  4.673210786  -250.2518284
GPU  TIME_LIMIT  250602000   4.64700347  -249.3609757
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched3-3:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   52986800  2432689.742  2432616.795
GPU  OPTIMAL   54340200   2432694.06  2432526.403
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance bdry2:
      Status Iterations      Primal obj        Dual obj
CPU  OPTIMAL   33554200  0.004177643673  0.004222602356
GPU  OPTIMAL   33558800  0.004177664638  0.004227880325
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance piperout-d20:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1381600  25533.84286  25533.84195
GPU  OPTIMAL    1560400  25533.94809  25533.84247
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance sct1:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL      49000   -218.089755  -218.1330731
GPU  OPTIMAL      49000  -218.0897577  -218.1330731
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance proteindesign121hz512p9:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1898200   1423.90403  1423.962777
GPU  OPTIMAL    2684200  1423.904585  1423.944118
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance sct5:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5700  -228.1952931  -228.1610876
GPU  OPTIMAL       5700  -228.1952954  -228.1610876
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-76-50:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     772600  -14173883.52  -14174418.73
GPU  OPTIMAL     612600  -14173884.23  -14175314.28
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-75-50:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     549000  -14182312.02  -14182139.66
GPU  OPTIMAL     549000  -14182312.01  -14182139.61
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance hgms62:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL   20860200  -53405.66375  -53416.34387
GPU  OPTIMAL   20860200  -53405.66394  -53416.34369
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4413714-turia:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   25849200  43.47503004  43.47695371
GPU  OPTIMAL   26097200  43.47871915   43.4715827
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim7:
      Status Iterations   Primal obj    Dual obj
CPU  OPTIMAL     456200  2337.078721  2336.73184
GPU  OPTIMAL     456200   2337.07871  2336.73184
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-3208254-reiu:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       9000  -38604.86166  -38606.26268
GPU  OPTIMAL       9000  -38604.86167  -38606.26268
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance shs1042:
      Status Iterations  Primal obj     Dual obj
CPU  OPTIMAL    3663200  9442.47166  9444.253513
GPU  OPTIMAL    3663200  9442.47166  9444.253512
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched6-1:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     301200  7499.998066     7501.475
GPU  OPTIMAL     301200  7499.997697  7501.480112
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched3-4:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   72542200  1124642.272  1124806.832
GPU  OPTIMAL   52752600  1124688.273  1124585.589
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim4:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     929800  9.205938105  9.206941924
GPU  OPTIMAL     929800  9.205929986  9.206942036
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim2:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1300800  10.86447831  10.86236456
GPU  OPTIMAL    1284600  10.86393657  10.86326915
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-3229051-yass:
              Status Iterations        Primal obj     Dual obj
CPU  DUAL_INFEASIBLE        200  -1.992184297e+12  1441960.542
GPU  DUAL_INFEASIBLE        200  -1.992184298e+12  1441960.541
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites2-25:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5100  -30.00002622  -29.99759189
GPU  OPTIMAL       5100  -30.00002622  -29.99759469
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance ds-big:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     933200  86.83684649  86.81978274
GPU  OPTIMAL     934000  86.83691952  86.81978541
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-76-40:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     191800  -14171085.83  -14171901.21
GPU  OPTIMAL     191800  -14171085.81   -14171901.2
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance germanrr:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     345400  45980142.71  45980135.42
GPU  OPTIMAL     347000  45980143.85  45980135.42
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4332810-sesia:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     740600  -33931.28309   -33930.7782
GPU  OPTIMAL    2763400  -33929.56003  -33926.81054
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance proteindesign122trx11p8:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1416200  1720.461027  1720.694719
GPU  OPTIMAL    1416200  1720.461026  1720.694715
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4332801-seret:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    3552400  222338.3266  222313.5325
GPU  OPTIMAL    1400800  222342.9597  222320.9406
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4535459-waipa:
         Status Iterations       Primal obj          Dual obj
CPU  TIME_LIMIT    7867000  3.059144606e+15  -4.360493003e+10
GPU  TIME_LIMIT    7886000   3.05706559e+15  -4.368560407e+10
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites3-25:
      Status Iterations   Primal obj      Dual obj
CPU  OPTIMAL       4900  -39.0003573  -38.99905447
GPU  OPTIMAL       4900  -39.0003573  -38.99905805
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites4-25:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5000  -41.00318391  -41.01144414
GPU  OPTIMAL       5000  -41.00318391  -41.01144562
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites2-40:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5300  -29.99997468  -30.00582128
GPU  OPTIMAL       5300  -29.99997468  -30.00581932
------------------------------------------------------------
Total mismatches: 30 / 414 instances

In most mismatched cases, the differences are at the level of floating-point noise, and both implementations return the same status and a very similar number of iterations.

However, a few instances stand out with noticeably larger discrepancies between the CPU and GPU runs:

  • neos-4332810-sesia

    • CPU: OPTIMAL, 740,600 iterations
      • Primal obj: -33931.28309
      • Dual obj: -33930.77820
    • GPU: OPTIMAL, 2,763,400 iterations
      • Primal obj: -33929.56003
      • Dual obj: -33926.81054
  • neos-4332801-seret

    • CPU: OPTIMAL, 3,552,400 iterations
      • Primal obj: 222338.3266
      • Dual obj: 222313.5325
    • GPU: OPTIMAL, 1,400,800 iterations
      • Primal obj: 222342.9597
      • Dual obj: 222320.9406

Preconditioner timing observations

I also looked at the GPU preconditioner timing. Almost all instances have very cheap preconditioning on the GPU (Ruiz + Pock–Chambolle + bound-objective are usually on the order of 1e-3 seconds in total).

The only clear outliers are again neos-4332810-sesia and neos-4332801-seret. For these two instances, the GPU preconditioner times are:

  • neos-4332810-sesia

    • Ruiz scaling (10 iterations): 2.155691 sec
    • Pock-Chambolle scaling (alpha=1.0): 0.613833 sec
    • Bound-objective scaling: 0.020169 sec
  • neos-4332801-seret

    • Ruiz scaling (10 iterations): 2.158601 sec
    • Pock-Chambolle scaling (alpha=1.0): 0.607071 sec
    • Bound-objective scaling: 0.020089 sec

So these are currently the only instances where GPU preconditioning exceeds ~2 seconds; all other instances are around the 1e-3 second level for the same stages. Most of the extra time comes specifically from the Ruiz scaling stage, which is significantly more expensive in these instances compared to the rest of the benchmark set.

For completeness, here are the CPU preconditioner times for the same two problematic instances:

  • neos-4332810-sesia

    • Ruiz scaling (10 iterations): 3.481281 sec
    • Pock-Chambolle scaling (alpha=1.0): 0.946331 sec
    • Bound-objective scaling: 0.100667 sec
  • neos-4332801-seret

    • Ruiz scaling (10 iterations): 3.377352 sec
    • Pock-Chambolle scaling (alpha=1.0): 0.930314 sec
    • Bound-objective scaling: 0.091834 sec

As expected, CPU preconditioning is consistently slower on these instances. However, the GPU times for these two cases are still unusually large.

pyproject.toml Outdated
[project]
name = "cupdlpx"
version = "0.1.3"
version = "0.1.4"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the versioning setup in the project, so the C version no longer depends on pyproject.toml.
I’m also considering moving the Python interface to a separate repository for better maintenance and organization. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that’s a great idea. We need to do that later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants