Fast Log Density Function #1113

penelopeysm · 2025-11-06T00:03:30Z

This PR introduces FastLDF which is a much faster version of LogDensityFunction. How much faster you ask? Read on!

Motivation

Currently, LDF does two things to calculate the log-density:

It sticks values into the VarInfo using unflatten.
It then evaluates the models using the VarInfo, which means reading values from the VarInfo again.

Note that reading values from the VarInfo is relatively slow, because you have to index into NamedTuples, Dicts, and so on.

Now, there is really no reason why LDF should EVER need to write to the VarInfo. In my opinion it should be an error (or undefined behaviour) to access ldf.varinfo once the LDF has been constructed. It should be completely opaque: you feed in a vector of parameters, you get a float out. How it does this, whether it mutates ldf.varinfo, etc., should be none of the user's business.

So, FastLDF removes all this internal business and instead calculates the log-density by:

Untangling the internal structure of the VarInfo to find out which parts of the input vector correspond to which VarNames;
Simply reading from the provided vector.

That means that FastLDF doesn't need to carry around Metadata at all, it can just carry accumulators around.

Benchmarks

And when I say faster, I mean A LOT faster. Note that all of these benchmarks are run on 1.11.7, with 1 thread (using more than 1 thread it kills performance):

using DynamicPPL, Distributions, LogDensityProblems, Chairmarks, LinearAlgebra
using ADTypes, ForwardDiff, Mooncake, ReverseDiff, Enzyme

const adtypes = [
    ("FD", AutoForwardDiff()),
    ("RD", AutoReverseDiff()),
    ("MC", AutoMooncake()),
    ("EN" => AutoEnzyme(; mode=set_runtime_activity(Reverse), function_annotation=Const))
]

function benchmark_ldfs(model)
    vi = VarInfo(model)
    x = vi[:]
    ldf_no = DynamicPPL.LogDensityFunction(model, getlogjoint, vi)
    fldf_no = DynamicPPL.FastLDF(model, getlogjoint, vi)
    @assert LogDensityProblems.logdensity(ldf_no, x) ≈ LogDensityProblems.logdensity(fldf_no, x)
    print("LogDensityFunction: eval      ----  ")
    display(median(@be LogDensityProblems.logdensity(ldf_no, x)))
    print("           FastLDF: eval      ----  ")
    display(median(@be LogDensityProblems.logdensity(fldf_no, x)))
    for name_adtype in adtypes
        name, adtype = name_adtype
        ldf = DynamicPPL.LogDensityFunction(model, getlogjoint, vi; adtype=adtype)
        fldf = DynamicPPL.FastLDF(model, getlogjoint, vi; adtype=adtype)
        @assert LogDensityProblems.logdensity_and_gradient(ldf, x)[2] ≈ LogDensityProblems.logdensity_and_gradient(fldf, x)[2]
        print("LogDensityFunction: grad ($name) ----  ")
        display(median(@be LogDensityProblems.logdensity_and_gradient(ldf, x)))
        print("           FastLDF: grad ($name) ----  ")
        display(median(@be LogDensityProblems.logdensity_and_gradient(fldf, x)))
    end
end

# ------------------------------------------------------------

# Trivial model.
# The eagle-eyed will notice that compared to previous stages of this PR, the benchmarks are a tiny bit
# worse (~ 10%). That's because I had to include features to handle things like thread safety and dual types.
@model f() = x ~ Normal()
benchmark_ldfs(f())
#=
LogDensityFunction: eval      ----  172.653 ns (6 allocs: 192 bytes)
           FastLDF: eval      ----  10.925 ns
LogDensityFunction: grad (FD) ----  308.464 ns (13 allocs: 496 bytes)
           FastLDF: grad (FD) ----  55.607 ns (3 allocs: 96 bytes)
LogDensityFunction: grad (RD) ----  4.315 μs (82 allocs: 3.062 KiB)
           FastLDF: grad (RD) ----  3.144 μs (46 allocs: 1.562 KiB)
LogDensityFunction: grad (MC) ----  1.099 μs (25 allocs: 1.219 KiB)
           FastLDF: grad (MC) ----  337.402 ns (4 allocs: 192 bytes)
LogDensityFunction: grad (EN) ----  435.262 ns (16 allocs: 560 bytes)
           FastLDF: grad (EN) ----  127.580 ns (2 allocs: 64 bytes)
=#

# ------------------------------------------------------------

# More realistic (albeit still quite small) model.
y = [28, 8, -3, 7, -1, 1, 18, 12]
sigma = [15, 10, 16, 11, 9, 11, 10, 18]
@model function eight_schools(y, sigma)
    mu ~ Normal(0, 5)
    tau ~ truncated(Cauchy(0, 5); lower=0)
    theta ~ MvNormal(fill(mu, length(y)), tau^2 * I)
    for i in eachindex(y)
        y[i] ~ Normal(theta[i], sigma[i])
    end
    return (mu=mu, tau=tau)
end
benchmark_ldfs(eight_schools(y, sigma))
#=
LogDensityFunction: eval      ----  840.676 ns (21 allocs: 1.344 KiB)
           FastLDF: eval      ----  210.714 ns (4 allocs: 256 bytes)
LogDensityFunction: grad (FD) ----  1.544 μs (28 allocs: 5.484 KiB)
           FastLDF: grad (FD) ----  664.634 ns (11 allocs: 2.594 KiB)
LogDensityFunction: grad (RD) ----  41.042 μs (614 allocs: 25.562 KiB)
           FastLDF: grad (RD) ----  39.291 μs (562 allocs: 20.562 KiB)
LogDensityFunction: grad (MC) ----  4.321 μs (64 allocs: 4.016 KiB)
           FastLDF: grad (MC) ----  1.198 μs (12 allocs: 784 bytes)
LogDensityFunction: grad (EN) ----  1.797 μs (44 allocs: 2.609 KiB)
           FastLDF: grad (EN) ----  715.113 ns (13 allocs: 832 bytes)
=#

# Just for comparison. I don't doubt that there are overheads calling Stan from Julia, etc.
using PosteriorDB, LogDensityProblems, Chairmarks
using StanLogDensityProblems: StanProblem
pdb = PosteriorDB.database()
post = PosteriorDB.posterior(pdb, "eight_schools-eight_schools_centered")
prob = StanProblem(post, ".")
x = rand(10)
@be LogDensityProblems.logdensity($prob, $x)
#=
julia> @be LogDensityProblems.logdensity($prob, $x)
Benchmark: 3226 samples with 34 evaluations
 min    821.059 ns (1 allocs: 16 bytes)
 median 848.029 ns (1 allocs: 16 bytes)
 mean   853.905 ns (1 allocs: 16 bytes)
 max    1.395 μs (1 allocs: 16 bytes)
=#

# ------------------------------------------------------------

# One with lots of non-identity VarNames.
@model function badvarnames()
    N = 20
    x = Vector{Float64}(undef, N)
    for i in 1:N
        x[i] ~ Normal()
    end
end
benchmark_ldfs(badvarnames())
#=
LogDensityFunction: eval      ----  1.465 μs (46 allocs: 1.906 KiB)
           FastLDF: eval      ----  358.337 ns (2 allocs: 224 bytes)
LogDensityFunction: grad (FD) ----  4.458 μs (103 allocs: 14.266 KiB)
           FastLDF: grad (FD) ----  2.640 μs (11 allocs: 4.281 KiB)
LogDensityFunction: grad (RD) ----  63.208 μs (1076 allocs: 38.828 KiB)
           FastLDF: grad (RD) ----  51.792 μs (773 allocs: 27.438 KiB)
LogDensityFunction: grad (MC) ----  6.802 μs (160 allocs: 7.000 KiB)
           FastLDF: grad (MC) ----  2.074 μs (28 allocs: 1.094 KiB)
LogDensityFunction: grad (EN) ----  3.181 μs (64 allocs: 6.000 KiB)
           FastLDF: grad (EN) ----  1.693 μs (5 allocs: 2.047 KiB)
=#

# ------------------------------------------------------------

# Submodels too.
@model function inner()
    m ~ Normal(0, 1)
    s ~ Exponential()
    return (m=m, s=s)
end
@model function withsubmodel()
    params ~ to_submodel(inner())
    y ~ Normal(params.m, params.s)
    1.0 ~ Normal(y)
end
benchmark_ldfs(withsubmodel())
#=
LogDensityFunction: eval      ----  921.875 ns (20 allocs: 1.234 KiB)
           FastLDF: eval      ----  105.248 ns
LogDensityFunction: grad (FD) ----  1.191 μs (27 allocs: 2.219 KiB)
           FastLDF: grad (FD) ----  173.485 ns (3 allocs: 112 bytes)
LogDensityFunction: grad (RD) ----  14.500 μs (221 allocs: 9.266 KiB)
           FastLDF: grad (RD) ----  11.188 μs (148 allocs: 5.188 KiB)
LogDensityFunction: grad (MC) ----  5.708 μs (72 allocs: 3.312 KiB)
           FastLDF: grad (MC) ----  575.000 ns (6 allocs: 240 bytes)
LogDensityFunction: grad (EN) ----  2.451 μs (52 allocs: 2.500 KiB)
           FastLDF: grad (EN) ----  328.591 ns (2 allocs: 80 bytes)
=#

I expected it to be faster, but I am frankly amazed myself at just how much faster it is. For these models, this probably brings Turing much closer to Stan (Julia TTFX is still an issue of course).

MCMC sampling

Here is MCMC sampling on the eight schools model, using NUTS which effectively uses nothing but LogDensityFunction in a tight loop. Given the above benchmarks we should expect a 3x or 4x speedup, and that is indeed what we get, with the only code change being LogDensityFunction -> FastLDF in src/mcmc/hmc.jl.

Notice that the DPPL benchmarks above only used unlinked parameters, but this also confirms that linked parameters have similar speedups:

using Turing, Random, Mooncake

for adtype in [AutoForwardDiff(), AutoMooncake()]
    @time chain = sample(Xoshiro(468), eight_schools(y, sigma), NUTS(; adtype=adtype), 1000; progress=false)
end

#=
LogDensityFunction ForwardDiff - 0.556647 seconds (8.00 M allocations: 774.359 MiB, 9.03% gc time)
           FastLDF ForwardDiff - 0.190099 seconds (3.47 M allocations: 393.619 MiB, 12.40% gc time)
LogDensityFunction Mooncake    - 0.794813 seconds (10.64 M allocations: 708.145 MiB, 5.49% gc time)
           FastLDF Mooncake    - 0.216566 seconds (3.59 M allocations: 212.214 MiB, 5.68% gc time)
=#

Following up from the Stan claim in the previous section: It's still not quite the same. For eight-schools, with Stan I managed to sample 20000 iterations in 0.50 seconds, and the fastest I could squeeze out of Turing/EnzymeReverse was 1.49 seconds with FastLDF. Still, that's WAY better than with LogDensityFunction, which takes 8.8 seconds.

(For what it's worth, the non-centred eight-schools model runs in 0.5 seconds with Enzyme/Turing/FastLDF, but Stan gets down to 0.26 seconds... Zeno's paradox, argh!)

Remove `unflatten`?!

I have yet to investigate this fully, but I strongly believe that we can actually completely get rid of unflatten now. unflatten only ever served one purpose, which was to obey the LogDensityProblems interface. But FastLDF does that without using unflatten.

Note that if we want to convert a vector of parameters into a Dict of parameters (say, after we finish sampling), we can reuse the FastLogDensityAt code and just tack on ValuesAsInModelAccumulator. No need to use unflatten + evaluate!!.

Next steps

I'm happy with where this PR is. See #1120 for a checklist of things to investigate before this becomes the 'official' LDF.

github-actions · 2025-11-06T00:05:06Z

Benchmark Report for Commit `8715446`

Computer Information

Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Benchmark Results

┌───────────────────────┬───────┬─────────────┬───────────────────┬────────┬────────────────┬─────────────────┐
│                 Model │   Dim │  AD Backend │           VarInfo │ Linked │ t(eval)/t(ref) │ t(grad)/t(eval) │
├───────────────────────┼───────┼─────────────┼───────────────────┼────────┼────────────────┼─────────────────┤
│ Simple assume observe │     1 │ forwarddiff │             typed │  false │            6.6 │             1.8 │
│           Smorgasbord │   201 │ forwarddiff │             typed │  false │          740.3 │            43.3 │
│           Smorgasbord │   201 │ forwarddiff │ simple_namedtuple │   true │          415.3 │            56.3 │
│           Smorgasbord │   201 │ forwarddiff │           untyped │   true │          807.3 │            38.5 │
│           Smorgasbord │   201 │ forwarddiff │       simple_dict │   true │         6576.3 │            27.2 │
│           Smorgasbord │   201 │ forwarddiff │      typed_vector │   true │          807.5 │            40.0 │
│           Smorgasbord │   201 │ forwarddiff │    untyped_vector │   true │          809.7 │            37.0 │
│           Smorgasbord │   201 │ reversediff │             typed │   true │          910.8 │            46.2 │
│           Smorgasbord │   201 │    mooncake │             typed │   true │          733.3 │             5.9 │
│           Smorgasbord │   201 │      enzyme │             typed │   true │          919.9 │             4.0 │
│    Loop univariate 1k │  1000 │    mooncake │             typed │   true │         3990.7 │             5.8 │
│       Multivariate 1k │  1000 │    mooncake │             typed │   true │         1037.2 │             8.9 │
│   Loop univariate 10k │ 10000 │    mooncake │             typed │   true │        43927.4 │             5.3 │
│      Multivariate 10k │ 10000 │    mooncake │             typed │   true │         8782.4 │             9.9 │
│               Dynamic │    10 │    mooncake │             typed │   true │          123.9 │            11.4 │
│              Submodel │     1 │    mooncake │             typed │   true │            8.8 │             6.6 │
│                   LDA │    12 │ reversediff │             typed │   true │         1001.9 │             2.1 │
└───────────────────────┴───────┴─────────────┴───────────────────┴────────┴────────────────┴─────────────────┘

codecov · 2025-11-06T00:09:13Z

Codecov Report

❌ Patch coverage is 89.76378% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.72%. Comparing base (08fffa2) to head (8715446).

Files with missing lines	Patch %	Lines
src/fasteval.jl	90.74%	10 Missing ⚠️
src/compiler.jl	85.71%	2 Missing ⚠️
ext/DynamicPPLEnzymeCoreExt.jl	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1113      +/-   ##
==========================================
+ Coverage   81.48%   81.72%   +0.24%     
==========================================
  Files          40       41       +1     
  Lines        3845     3956     +111     
==========================================
+ Hits         3133     3233     +100     
- Misses        712      723      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-11-06T00:46:22Z

DynamicPPL.jl documentation for PR #1113 is available at:
https://TuringLang.github.io/DynamicPPL.jl/previews/PR1113/

mhauru · 2025-11-06T09:12:10Z

OMG. PRotY?

sethaxen · 2025-11-06T09:27:01Z

This is super exciting! Is there any reason for FastLDF to be a separate struct from LogDensityFunction instead of replacing it?

mhauru · 2025-11-06T10:11:10Z

I was confused as to why this made such a huge difference given that the logic of what gets eventually executed seems quite similar to how Metadata works: There's a dict with VarNames as keys that gives you the range and the link status (in Metadata with an extra level of indirection because of looking up Vector elements, but that should be pretty cheap), and we use those to index into the vector of values and get the transform.

The only obvious difference I could see was the use of @view when indexing vals. I stuck a similar @view into getindex(_internal) of Metadata and that cut the allocations it made for the trivial test model from 6 to 4, and the runtime by about a third as well. So that makes sense in the light of FastLDF having zero allocations: Most of the time is spent doing that alloc.

Now the question is, why does our current Metadata still do 4 allocations even when using @view. EDIT: The following is wrong, thanks @nsiccha. I had not understood this before, but turns out sticking the Vector of values given as input to the LDF into any sort of struct causes an allocation, even when you don't copy the vector. Even this allocates:

# This is benchmarking done wrong.
julia> module MiniBench
       using Chairmarks
       v = randn(8)
       f(v) = (v,)
       println(@b f(v))
       end
Sample(evals=831, time=2.261371841155235e-8, allocs=1, bytes=16)

This makes me want to rethink VarInfo again from the ground up, try to see what's the minimal structure we can get away with if we use this sort of approach where we separate holding the values of variables from holding information about how individual variable values are read. Need to think through any edge cases, but I'm prrrretty excited for this.

nsiccha · 2025-11-06T10:20:12Z

@mhauru I think in your minibenchmark, the allocation is due to accessing the non-constant global variable v. Marking v as const, interpolating it into the benchmark, or using @b randn(8) f gets rid of the allocation for me.

penelopeysm · 2025-11-06T13:11:56Z

src/fastldf.jl

+    accs::Accs
+end
+DynamicPPL.getaccs(vi::OnlyAccsVarInfo) = vi.accs
+DynamicPPL.maybe_invlink_before_eval!!(vi::OnlyAccsVarInfo, ::Model) = vi


It turns out that a substantial amount of performance gains were in this innocuous line 😄 Just optimising the implementation of this function for normal VarInfo, plus sticking some @views in appropriate places, gets the old LogDensityFunction to within ~ 50% of FastLDF's performance. See #1115.

In some ways the simplification here is a result of the current approach: because the linked status is no longer stored in a VarInfo object but rather a separate struct, there is no longer any need to care about linking / invlinking a VarInfo.

But overall I think this means that the performance gains from the current approach are much more comprehensible.

penelopeysm · 2025-11-06T13:32:20Z

Is there any reason for FastLDF to be a separate struct from LogDensityFunction instead of replacing it?

Not really. There's one model above where ForwardDiff is a bit slower, but my suspicion is that that could be optimised away in a similar way to declaring zero derivatives for Mooncake/Enzyme. I think the goal should definitely be to replace LDF.

(At the PR stage though, having both structs does make benchmarking much easier 😉)

mhauru · 2025-11-06T13:36:58Z

There's a lot to be said about this PR. However, my most prominent lesson so far is that

We can really squeeze down the allocations caused by our unflatten and evaluate!! calls to zero.
That doing so makes a very big difference to the runtime of small models.
That doing so makes a surprisinly big difference to the runtime of even large models.

Point 3. is very much worth underlining. My first thought was that this is cool for small models, but no, this matters for all models. Here's Penny's benchmarking run on a slightly bigger model:

       @model function f()
           dim = 1_000
           x ~ MvNormal(zeros(dim), I)
           y ~ product_distribution([Exponential() for _ in 1:dim])
           0.0 ~ Normal(sum(x), sum(y))
           return nothing
       end
       benchmark_ldfs(f())

       end
LogDensityFunction: eval      ----  10.042 μs (25 allocs: 56.766 KiB)
           FastLDF: eval      ----  6.531 μs (10 allocs: 24.234 KiB)
LogDensityFunction: grad (FD) ----  17.710 ms (4184 allocs: 107.243 MiB, 34.66% gc time)
           FastLDF: grad (FD) ----  8.143 ms (1674 allocs: 23.538 MiB, 33.01% gc time)
LogDensityFunction: grad (RD) ----  1.377 ms (23146 allocs: 985.000 KiB)
           FastLDF: grad (RD) ----  2.313 ms (38082 allocs: 1.430 MiB)

The above numbers scale linearly in the dimension of the model (at least eval, I didn't look at grads).

As I mentioned above, I was confused why the changes in this PR made such a difference, when the logic in the pipeline of what is done to the input vector ends up being quite similar. To understand that, @penelopeysm and I made a PR that gets allocations down to 0 for small models using trusty old Metadata: #1115. Comparing the code in that PR vs FastLDF, they are on par for large models:

       @model function f()
           dim = 1_000
           x ~ MvNormal(zeros(dim), I)
           y ~ product_distribution([Exponential() for _ in 1:dim])
           0.0 ~ Normal(sum(x), sum(y))
           return nothing
       end
       benchmark_ldfs(f())

       end
LogDensityFunction: eval      ----  6.709 μs (10 allocs: 24.234 KiB)
           FastLDF: eval      ----  6.458 μs (10 allocs: 24.234 KiB)
LogDensityFunction: grad (FD) ----  7.024 ms (2347 allocs: 23.806 MiB, 13.95% gc time)
           FastLDF: grad (FD) ----  5.223 ms (1674 allocs: 23.538 MiB)
LogDensityFunction: grad (RD) ----  2.323 ms (38097 allocs: 1.448 MiB)
           FastLDF: grad (RD) ----  2.290 ms (38082 allocs: 1.430 MiB)

while FastLDF still has an edge, but a smaller edge, for small models:

       @model function f()
           x ~ Normal()
           0.0 ~ Normal(x)
       end
       benchmark_ldfs(f())

       end
LogDensityFunction: eval      ----  40.790 ns
           FastLDF: eval      ----  16.743 ns
LogDensityFunction: grad (FD) ----  239.837 ns (7 allocs: 272 bytes)
           FastLDF: grad (FD) ----  61.845 ns (3 allocs: 96 bytes)
LogDensityFunction: grad (RD) ----  6.625 μs (107 allocs: 4.000 KiB)
           FastLDF: grad (RD) ----  5.775 μs (81 allocs: 2.781 KiB)

That's the sort of difference and scaling I can understand, given the changes to internal data structures.

Now that we know we can do 0 allocation evals, unless there's something deeply conceptually problematic about the use of @view, I think nothing else will do anymore. We can finalise this PR or #1115 or some other version of this, but I expect to soon stick in our tests a line that says @test @allocations(LogDensityProblems.logdensity(ldf_for_trivial_model, x)) == 0 and nothing will be merged unless that passes.

mhauru · 2025-11-06T13:41:44Z

src/fasteval.jl

+struct FastLDFContext{N<:NamedTuple,T<:AbstractVector{<:Real}} <: AbstractContext
+    # The ranges of identity VarNames are stored in a NamedTuple for improved performance
+    # (it's around 1.5x faster).
+    iden_varname_ranges::N


When talking with @penelopeysm, I did a hasty check of whether this mechanism makes a substantial difference to performance, as opposed to having them all in varname_ranges as a Dict. My conclusions was no, hers was yes. It could be that I botched my check, but also I was running 1.12.1 and she's on 1.11.7, so we should check if that's significant.

src/fastldf.jl

penelopeysm · 2025-11-06T13:51:21Z

Another upshot of this is that, because FastLDF no longer relies on a particular kind of VarInfo (it only uses it at the beginning to construct the ranges), it means that AD doesn't need to handle every kind of SimpleVarInfo, etc etc. It only needs to handle OnlyAccsVarInfo i.e. an accumulator tuple.

penelopeysm

OK, I think this is ready for proper review. I've elected to keep it named as FastLDF, under the DynamicPPL.Experimental module, and to therefore use a patch release. The reasons for this are:

We can move fast and break things in patch releases, without worrying about contaminating the breaking branch with something that isn't yet ready for release.
With both structs side by side, we can not only benchmark them against each other, but also use them for correctness tests. I think that's quite worth having especially in this nascent period where things look really good but we're unsure if we messed up some weird edge case.
Although it's fairly close, it's IMO not production ready yet. There are a few things I'd still like to sort out before we replace LDF, which I'll list in #1120.

penelopeysm · 2025-11-06T21:44:57Z

src/fasteval.jl

+function get_ranges_and_linked(varinfo::VarInfo{<:NamedTuple{syms}}) where {syms}
+    all_iden_ranges = NamedTuple()
+    all_ranges = Dict{VarName,RangeAndLinked}()
+    offset = 1
+    for sym in syms
+        md = varinfo.metadata[sym]
+        this_md_iden, this_md_others, offset = get_ranges_and_linked_metadata(md, offset)
+        all_iden_ranges = merge(all_iden_ranges, this_md_iden)
+        all_ranges = merge(all_ranges, this_md_others)
+    end
+    return all_iden_ranges, all_ranges
+end


This function, and the ranges it extracts, is written in a way to be 'compatible' with what unflatten / vi[:] currently does. However, I would like to believe that in the future there will not even be any reason to even use this. If you think about it, the ranges are completely meaningless:

Dict(@varname(a) => 1:1, @varname(b) => 2:2, @varname(c) => 3:5)

really, at the end of the day, describes the same thing as

Dict(@varname(b) => 1:1, @varname(c) => 2:4, @varname(a) => 5:5)

It just amounts to a permutation of the parameter vector, and the exact permutation doesn't matter! Of course, what really matters is that it's consistent between different bits of the code. However, I would actually like to propose that we (one day) get rid of this dependence on the internal Metadata structure, and instead do something like this:

Get a sorted list of all VarNames.

Get their lengths.

The parameter vector is just those VarNames in ascending order.

This would greatly simplify the code here, and probably also simplify the code for varinfo[:]. It WOULD make unflatten a pain. But that's why I say 'one day': I am really hoping that this PR is the one that sounds the death knell for unflatten.

Just a note that came to mind: It might make a difference for performance to have the elements of the vals vector be roughly in the order in which they are read, to make use of caching and such. If e.g. x[1] and x[2] are at different ends, then a loop over x[i] might cause very inefficient memory reads. This isn't incompatible with the above, just a thing to keep in mind when redesigning variable order.

src/fasteval.jl

mhauru · 2025-11-10T11:57:46Z

I need to read the PR properly, but based on skimming, I'm still a bit confused by the performance gains here. FastEvalVectorContext is very close to a simplified Metadata object: It stores a reference to the parameter vector and pre-existing ranges and link statuses, and indexes the parameter vector in tilde_assume!! based on those ranges. On this level of detail, that description applies equally well to calling unflatten on an untyped Metadata. The one thing I can think of that is quite different, is that a typed Metadata uses @view to split the input vector into subranges by the lead symbol straight away in the unflatten call, whereas FastEvalVectorContext keeps it all in a single vector until what is effectively a getindex_internal call in its tilde_assume!! method. Is that what makes the big difference? Or am I missing some other significant difference?

penelopeysm · 2025-11-10T19:27:22Z

Just a note to say that I tried with FixedSizeArrays as inputs and things were sometimes marginally faster but mostly marginally slower (especially AD, not sure if there might be some potential for optimisation there).

* Make InitContext work with OnlyAccsVarInfo * Do not convert NamedTuple to Dict * remove logging * Enable InitFromPrior and InitFromUniform too * Fix `infer_nested_eltype` invocation

Fast Log Density Function

7cddac7

github-actions bot assigned penelopeysm Nov 6, 2025

Make it work with AD

5ed4295

penelopeysm force-pushed the py/fastldf branch from 94ec51f to f8b6fbd Compare November 6, 2025 02:04

Optimise performance for identity VarNames

e199520

penelopeysm force-pushed the py/fastldf branch from f8b6fbd to e199520 Compare November 6, 2025 02:05

penelopeysm marked this pull request as draft November 6, 2025 02:34

penelopeysm added 2 commits November 6, 2025 02:43

Mark get_range_and_linked as having zero derivative

4cefaca

Update comment

6dfd106

penelopeysm marked this pull request as ready for review November 6, 2025 03:27

penelopeysm commented Nov 6, 2025

View reviewed changes

mhauru mentioned this pull request Nov 6, 2025

Reduce allocations for VarInfo #1115

Draft

mhauru reviewed Nov 6, 2025

View reviewed changes

src/fastldf.jl Outdated Show resolved Hide resolved

penelopeysm added 6 commits November 6, 2025 13:53

make AD testing / benchmarking use FastLDF

41ee7f3

Fix tests

22e32a6

Optimise away make_evaluate_args_and_kwargs

79cc128

const func annotation

f7c6a78

Disable benchmarks on non-typed-Metadata-VarInfo

b1a7650

Fix _evaluate!! correctly to handle submodels

e60873a

Don't override _evaluate!!, that breaks ForwardDiff (sometimes)

075cee8

penelopeysm mentioned this pull request Nov 6, 2025

FastLDF vs Original But Optimised LDF #1118

Draft

penelopeysm added 7 commits November 6, 2025 20:49

Move FastLDF to experimental for now

5f5a92c

Fix imports, add tests, etc

0716de5

More test fixes

cd2461e

Fix imports / tests

1b8b873

Remove AbstractFastEvalContext

ff5680d

Changelog and patch bump

500d5ac

Add correctness tests, fix imports

e560c30

penelopeysm commented Nov 6, 2025

View reviewed changes

Merge branch 'main' into py/fastldf

22e9dbe

penelopeysm requested a review from mhauru November 6, 2025 22:11

Merge branch 'main' into py/fastldf

c3bdcd0

This was referenced Nov 6, 2025

Remove TSVI or at least make TSVI opt-in (or or at least let TSVI be opt-out) #1086

Open

Implement ParamsWithStats(::FastLDF, ::AbstractVector{<:Real}) #1119

Open

Path to FastLDF #1120

Open

Fast InitContext #1125

Merged

penelopeysm added 2 commits November 8, 2025 17:25

Concretise parameter vector in tests

86d8a73

Merge branch 'main' into py/fastldf

4ec0c72

This was referenced Nov 8, 2025

Enable opting out of TSVI #1128

Draft

Performance improvements in returned/predict/etc penelopeysm/FlexiChains.jl#84

Open

Implement allocation tests for FastLDF #1126

Closed

penelopeysm added 3 commits November 9, 2025 02:43

Add zero-allocation tests

4b324b0

Add Chairmarks as test dep

c55171b

Disable allocations tests on multi-threaded

77deae9

penelopeysm mentioned this pull request Nov 10, 2025

Enzyme reverse stochastic failures with Threads.@threads + TSVI #1131

Open

Fast InitContext (#1125)

8715446

* Make InitContext work with OnlyAccsVarInfo * Do not convert NamedTuple to Dict * remove logging * Enable InitFromPrior and InitFromUniform too * Fix `infer_nested_eltype` invocation

penelopeysm mentioned this pull request Nov 11, 2025

FastLDF / InitContext unified #1132

Open

Fast Log Density Function #1113

Are you sure you want to change the base?

Fast Log Density Function #1113

Uh oh!

Conversation

penelopeysm commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmarks

MCMC sampling

Remove unflatten?!

Next steps

Uh oh!

github-actions bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Report for Commit 8715446

Computer Information

Benchmark Results

Uh oh!

codecov bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

mhauru commented Nov 6, 2025

Uh oh!

sethaxen commented Nov 6, 2025

Uh oh!

mhauru commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsiccha commented Nov 6, 2025

Uh oh!

penelopeysm Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

penelopeysm commented Nov 6, 2025

Uh oh!

mhauru commented Nov 6, 2025

Uh oh!

mhauru Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

penelopeysm commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penelopeysm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

penelopeysm Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

mhauru Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhauru commented Nov 10, 2025

Uh oh!

penelopeysm commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

penelopeysm commented Nov 6, 2025 •

edited

Loading

Remove `unflatten`?!

github-actions bot commented Nov 6, 2025 •

edited

Loading

Benchmark Report for Commit `8715446`

codecov bot commented Nov 6, 2025 •

edited

Loading

mhauru commented Nov 6, 2025 •

edited

Loading

penelopeysm Nov 6, 2025 •

edited

Loading

penelopeysm commented Nov 6, 2025 •

edited

Loading

penelopeysm left a comment •

edited

Loading