-
Notifications
You must be signed in to change notification settings - Fork 37
Fast Log Density Function #1113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Benchmark Report for Commit 8715446Computer InformationBenchmark Results |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1113 +/- ##
==========================================
+ Coverage 81.48% 81.72% +0.24%
==========================================
Files 40 41 +1
Lines 3845 3956 +111
==========================================
+ Hits 3133 3233 +100
- Misses 712 723 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
DynamicPPL.jl documentation for PR #1113 is available at: |
94ec51f to
f8b6fbd
Compare
f8b6fbd to
e199520
Compare
|
OMG. PRotY? |
|
This is super exciting! Is there any reason for |
|
I was confused as to why this made such a huge difference given that the logic of what gets eventually executed seems quite similar to how Metadata works: There's a dict with VarNames as keys that gives you the range and the link status (in Metadata with an extra level of indirection because of looking up Vector elements, but that should be pretty cheap), and we use those to index into the vector of values and get the transform. The only obvious difference I could see was the use of Now the question is, why does our current Metadata still do 4 allocations even when using # This is benchmarking done wrong.
julia> module MiniBench
using Chairmarks
v = randn(8)
f(v) = (v,)
println(@b f(v))
end
Sample(evals=831, time=2.261371841155235e-8, allocs=1, bytes=16)This makes me want to rethink VarInfo again from the ground up, try to see what's the minimal structure we can get away with if we use this sort of approach where we separate holding the values of variables from holding information about how individual variable values are read. Need to think through any edge cases, but I'm prrrretty excited for this. |
|
@mhauru I think in your minibenchmark, the allocation is due to accessing the non-constant global variable |
| accs::Accs | ||
| end | ||
| DynamicPPL.getaccs(vi::OnlyAccsVarInfo) = vi.accs | ||
| DynamicPPL.maybe_invlink_before_eval!!(vi::OnlyAccsVarInfo, ::Model) = vi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that a substantial amount of performance gains were in this innocuous line 😄 Just optimising the implementation of this function for normal VarInfo, plus sticking some @views in appropriate places, gets the old LogDensityFunction to within ~ 50% of FastLDF's performance. See #1115.
In some ways the simplification here is a result of the current approach: because the linked status is no longer stored in a VarInfo object but rather a separate struct, there is no longer any need to care about linking / invlinking a VarInfo.
But overall I think this means that the performance gains from the current approach are much more comprehensible.
Not really. There's one model above where ForwardDiff is a bit slower, but my suspicion is that that could be optimised away in a similar way to declaring zero derivatives for Mooncake/Enzyme. I think the goal should definitely be to replace LDF. (At the PR stage though, having both structs does make benchmarking much easier 😉) |
|
There's a lot to be said about this PR. However, my most prominent lesson so far is that
Point 3. is very much worth underlining. My first thought was that this is cool for small models, but no, this matters for all models. Here's Penny's benchmarking run on a slightly bigger model: @model function f()
dim = 1_000
x ~ MvNormal(zeros(dim), I)
y ~ product_distribution([Exponential() for _ in 1:dim])
0.0 ~ Normal(sum(x), sum(y))
return nothing
end
benchmark_ldfs(f())
end
LogDensityFunction: eval ---- 10.042 μs (25 allocs: 56.766 KiB)
FastLDF: eval ---- 6.531 μs (10 allocs: 24.234 KiB)
LogDensityFunction: grad (FD) ---- 17.710 ms (4184 allocs: 107.243 MiB, 34.66% gc time)
FastLDF: grad (FD) ---- 8.143 ms (1674 allocs: 23.538 MiB, 33.01% gc time)
LogDensityFunction: grad (RD) ---- 1.377 ms (23146 allocs: 985.000 KiB)
FastLDF: grad (RD) ---- 2.313 ms (38082 allocs: 1.430 MiB)The above numbers scale linearly in the dimension of the model (at least eval, I didn't look at grads). As I mentioned above, I was confused why the changes in this PR made such a difference, when the logic in the pipeline of what is done to the input vector ends up being quite similar. To understand that, @penelopeysm and I made a PR that gets allocations down to 0 for small models using trusty old Metadata: #1115. Comparing the code in that PR vs FastLDF, they are on par for large models: @model function f()
dim = 1_000
x ~ MvNormal(zeros(dim), I)
y ~ product_distribution([Exponential() for _ in 1:dim])
0.0 ~ Normal(sum(x), sum(y))
return nothing
end
benchmark_ldfs(f())
end
LogDensityFunction: eval ---- 6.709 μs (10 allocs: 24.234 KiB)
FastLDF: eval ---- 6.458 μs (10 allocs: 24.234 KiB)
LogDensityFunction: grad (FD) ---- 7.024 ms (2347 allocs: 23.806 MiB, 13.95% gc time)
FastLDF: grad (FD) ---- 5.223 ms (1674 allocs: 23.538 MiB)
LogDensityFunction: grad (RD) ---- 2.323 ms (38097 allocs: 1.448 MiB)
FastLDF: grad (RD) ---- 2.290 ms (38082 allocs: 1.430 MiB)
while FastLDF still has an edge, but a smaller edge, for small models: @model function f()
x ~ Normal()
0.0 ~ Normal(x)
end
benchmark_ldfs(f())
end
LogDensityFunction: eval ---- 40.790 ns
FastLDF: eval ---- 16.743 ns
LogDensityFunction: grad (FD) ---- 239.837 ns (7 allocs: 272 bytes)
FastLDF: grad (FD) ---- 61.845 ns (3 allocs: 96 bytes)
LogDensityFunction: grad (RD) ---- 6.625 μs (107 allocs: 4.000 KiB)
FastLDF: grad (RD) ---- 5.775 μs (81 allocs: 2.781 KiB)That's the sort of difference and scaling I can understand, given the changes to internal data structures. Now that we know we can do 0 allocation evals, unless there's something deeply conceptually problematic about the use of |
| struct FastLDFContext{N<:NamedTuple,T<:AbstractVector{<:Real}} <: AbstractContext | ||
| # The ranges of identity VarNames are stored in a NamedTuple for improved performance | ||
| # (it's around 1.5x faster). | ||
| iden_varname_ranges::N |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When talking with @penelopeysm, I did a hasty check of whether this mechanism makes a substantial difference to performance, as opposed to having them all in varname_ranges as a Dict. My conclusions was no, hers was yes. It could be that I botched my check, but also I was running 1.12.1 and she's on 1.11.7, so we should check if that's significant.
|
Another upshot of this is that, because FastLDF no longer relies on a particular kind of VarInfo (it only uses it at the beginning to construct the ranges), it means that AD doesn't need to handle every kind of SimpleVarInfo, etc etc. It only needs to handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this is ready for proper review. I've elected to keep it named as FastLDF, under the DynamicPPL.Experimental module, and to therefore use a patch release. The reasons for this are:
-
We can move fast and break things in patch releases, without worrying about contaminating the
breakingbranch with something that isn't yet ready for release. -
With both structs side by side, we can not only benchmark them against each other, but also use them for correctness tests. I think that's quite worth having especially in this nascent period where things look really good but we're unsure if we messed up some weird edge case.
-
Although it's fairly close, it's IMO not production ready yet. There are a few things I'd still like to sort out before we replace LDF, which I'll list in #1120.
| function get_ranges_and_linked(varinfo::VarInfo{<:NamedTuple{syms}}) where {syms} | ||
| all_iden_ranges = NamedTuple() | ||
| all_ranges = Dict{VarName,RangeAndLinked}() | ||
| offset = 1 | ||
| for sym in syms | ||
| md = varinfo.metadata[sym] | ||
| this_md_iden, this_md_others, offset = get_ranges_and_linked_metadata(md, offset) | ||
| all_iden_ranges = merge(all_iden_ranges, this_md_iden) | ||
| all_ranges = merge(all_ranges, this_md_others) | ||
| end | ||
| return all_iden_ranges, all_ranges | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function, and the ranges it extracts, is written in a way to be 'compatible' with what unflatten / vi[:] currently does. However, I would like to believe that in the future there will not even be any reason to even use this. If you think about it, the ranges are completely meaningless:
Dict(@varname(a) => 1:1, @varname(b) => 2:2, @varname(c) => 3:5)really, at the end of the day, describes the same thing as
Dict(@varname(b) => 1:1, @varname(c) => 2:4, @varname(a) => 5:5)It just amounts to a permutation of the parameter vector, and the exact permutation doesn't matter! Of course, what really matters is that it's consistent between different bits of the code. However, I would actually like to propose that we (one day) get rid of this dependence on the internal Metadata structure, and instead do something like this:
- Get a sorted list of all VarNames.
- Get their lengths.
- The parameter vector is just those VarNames in ascending order.
This would greatly simplify the code here, and probably also simplify the code for varinfo[:]. It WOULD make unflatten a pain. But that's why I say 'one day': I am really hoping that this PR is the one that sounds the death knell for unflatten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that came to mind: It might make a difference for performance to have the elements of the vals vector be roughly in the order in which they are read, to make use of caching and such. If e.g. x[1] and x[2] are at different ends, then a loop over x[i] might cause very inefficient memory reads. This isn't incompatible with the above, just a thing to keep in mind when redesigning variable order.
|
I need to read the PR properly, but based on skimming, I'm still a bit confused by the performance gains here. |
|
Just a note to say that I tried with FixedSizeArrays as inputs and things were sometimes marginally faster but mostly marginally slower (especially AD, not sure if there might be some potential for optimisation there). |
* Make InitContext work with OnlyAccsVarInfo * Do not convert NamedTuple to Dict * remove logging * Enable InitFromPrior and InitFromUniform too * Fix `infer_nested_eltype` invocation
This PR introduces
FastLDFwhich is a much faster version ofLogDensityFunction. How much faster you ask? Read on!Motivation
Currently, LDF does two things to calculate the log-density:
unflatten.Note that reading values from the VarInfo is relatively slow, because you have to index into NamedTuples, Dicts, and so on.
Now, there is really no reason why LDF should EVER need to write to the VarInfo. In my opinion it should be an error (or undefined behaviour) to access
ldf.varinfoonce the LDF has been constructed. It should be completely opaque: you feed in a vector of parameters, you get a float out. How it does this, whether it mutatesldf.varinfo, etc., should be none of the user's business.So, FastLDF removes all this internal business and instead calculates the log-density by:
That means that FastLDF doesn't need to carry around Metadata at all, it can just carry accumulators around.
Benchmarks
And when I say faster, I mean A LOT faster. Note that all of these benchmarks are run on 1.11.7, with 1 thread (using more than 1 thread it kills performance):
I expected it to be faster, but I am frankly amazed myself at just how much faster it is. For these models, this probably brings Turing much closer to Stan (Julia TTFX is still an issue of course).
MCMC sampling
Here is MCMC sampling on the eight schools model, using NUTS which effectively uses nothing but
LogDensityFunctionin a tight loop. Given the above benchmarks we should expect a 3x or 4x speedup, and that is indeed what we get, with the only code change beingLogDensityFunction->FastLDFinsrc/mcmc/hmc.jl.Notice that the DPPL benchmarks above only used unlinked parameters, but this also confirms that linked parameters have similar speedups:
Following up from the Stan claim in the previous section: It's still not quite the same. For eight-schools, with Stan I managed to sample 20000 iterations in 0.50 seconds, and the fastest I could squeeze out of Turing/EnzymeReverse was 1.49 seconds with
FastLDF. Still, that's WAY better than withLogDensityFunction, which takes 8.8 seconds.(For what it's worth, the non-centred eight-schools model runs in 0.5 seconds with Enzyme/Turing/FastLDF, but Stan gets down to 0.26 seconds... Zeno's paradox, argh!)
Remove
unflatten?!I have yet to investigate this fully, but I strongly believe that we can actually completely get rid of
unflattennow.unflattenonly ever served one purpose, which was to obey the LogDensityProblems interface. But FastLDF does that without usingunflatten.Note that if we want to convert a vector of parameters into a Dict of parameters (say, after we finish sampling), we can reuse the
FastLogDensityAtcode and just tack onValuesAsInModelAccumulator. No need to useunflatten+evaluate!!.Next steps
I'm happy with where this PR is. See #1120 for a checklist of things to investigate before this becomes the 'official' LDF.