-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
I made several major performance improvements over the last couple of weeks and thought that it makes sense to open an issue to describe where we stand and what I have found out. This issue can also be used for performance tracking in the future. I guess that I squeezed out about a factor of 3-5 and we are now on the same level as the fastest NFFT package on the CPU (FINUFFT).
- The most important improvement was achieved by blocking the domain where the equidistant sampling nodes lay. It was clear that this will enable multi-threading of the adjoint but it turns out that it also improves the serial NFFT. The reason is that is allows to avoid cache misses a lot.
- Blocking does increase precomputation time but we are still quite fast and in particular way faster than the transforms itself.
- Within a block it is important to not touch the non-equidistant sampling nodes since that would again lead to cache misses. This can be avoided by precalculating the required offsets for each block separately. The cache misses are still there but we move it to precomputation time
- I removed quite some type instabilities and the Julia profiler helped me a lot to find the bottlenecks.
- The full precomputation is now much slower than using the LUT. This is not too surprising since our B matrix has quite some structure that we cannot be easily exploited when treating it just as a sparse matrix. However, we keep it as a backend since it enables GPU computing
- I needed to switch Julia multi-threading libraries. Polyester.jl turned out to not play well with FFTW and I now use FLoops.jl.
There are certainly some smaller improvements possible. Here are some ideas:
- precomputation is not optimized yet
- In trafo/adjoint it might be possible to exploit SIMD instructions. Simply using the
@simdmacro did not help. Probably we already use SIMD? - Right now the block size is hardcoded to 64 in each direction, which is certainly not the best value in all possible situations. Probably we need some NFFT.MEASURE at some point.
Metadata
Metadata
Assignees
Labels
No labels