-
Notifications
You must be signed in to change notification settings - Fork 12
Add GPU implementation of preconditioner - v2 #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This update fixes two issues in the GPU preconditioner:
|
|
Hi @ZedongPeng and @jinwen-yang, I have now completed the experiments on MILPLIB and the Mittelmann LP benchmark. Below is an example figure comparing CPU vs GPU preconditioning time across all tested instances: For details of the result and plot, please see the following notebook: Read_Summary.ipynb Overall, the results show that GPU preconditioning is consistently faster than CPU preconditioning, often by several orders of magnitude across the benchmarks. The few cases in which the CPU appears marginally faster occur only when the preconditioning time on the CPU is already very small (< 0.1 sec). Consequently, because the overall solve time is typically dominated by the iterative optimization phase rather than the preconditioning step, the choice between CPU and GPU preconditioning has only a minor impact on the solve time. |
Solving timing observationsDuring the full MILPLIB and Mittelmann benchmark runs, I observed some minor numerical mismatches between the CPU and GPU runs. These differences occur only in a small subset of instances. Below is the full list of mismatches detected: In most mismatched cases, the differences are at the level of floating-point noise, and both implementations return the same status and a very similar number of iterations. However, a few instances stand out with noticeably larger discrepancies between the CPU and GPU runs:
Preconditioner timing observationsI also looked at the GPU preconditioner timing. Almost all instances have very cheap preconditioning on the GPU (Ruiz + Pock–Chambolle + bound-objective are usually on the order of 1e-3 seconds in total). The only clear outliers are again neos-4332810-sesia and neos-4332801-seret. For these two instances, the GPU preconditioner times are:
So these are currently the only instances where GPU preconditioning exceeds ~2 seconds; all other instances are around the 1e-3 second level for the same stages. Most of the extra time comes specifically from the Ruiz scaling stage, which is significantly more expensive in these instances compared to the rest of the benchmark set. For completeness, here are the CPU preconditioner times for the same two problematic instances:
As expected, CPU preconditioning is consistently slower on these instances. However, the GPU times for these two cases are still unusually large. |
pyproject.toml
Outdated
| [project] | ||
| name = "cupdlpx" | ||
| version = "0.1.3" | ||
| version = "0.1.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the versioning setup in the project, so the C version no longer depends on pyproject.toml.
I’m also considering moving the Python interface to a separate repository for better maintenance and organization. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that’s a great idea. We need to do that later.
…to preconditioner-v2

Summary
This PR introduces a CUDA-based implementation of the preconditioner module, including both Ruiz, Pock–Chambolle, and objective-bound scaling.
Main Changes
preconditioner.cuwithpreconditioner.cinitialize_solver_stateinsolver.cufor GPU preconditioner integrationImplementation Details
A[i,j] *= E[i]) without extra lookups.Next Step
Note
Reduce_bound_norm_sq_atomic currently relies on
atomicAdd(double*)for the bound-norm reduction, which requires CMAKE_CUDA_ARCHITECTURES ≥ 60.Would it be preferable to: