Skip to content

Conversation

@penelopeysm
Copy link
Member

@penelopeysm penelopeysm commented Nov 6, 2025

This PR introduces FastLDF which is a much faster version of LogDensityFunction. How much faster you ask? Read on!

Motivation

Currently, LDF does two things to calculate the log-density:

  1. It sticks values into the VarInfo using unflatten.
  2. It then evaluates the models using the VarInfo, which means reading values from the VarInfo again.

Note that reading values from the VarInfo is relatively slow, because you have to index into NamedTuples, Dicts, and so on.

Now, there is really no reason why LDF should EVER need to write to the VarInfo. In my opinion it should be an error (or undefined behaviour) to access ldf.varinfo once the LDF has been constructed. It should be completely opaque: you feed in a vector of parameters, you get a float out. How it does this, whether it mutates ldf.varinfo, etc., should be none of the user's business.

So, FastLDF removes all this internal business and instead calculates the log-density by:

  1. Untangling the internal structure of the VarInfo to find out which parts of the input vector correspond to which VarNames;
  2. Simply reading from the provided vector.

That means that FastLDF doesn't need to carry around Metadata at all, it can just carry accumulators around.

Benchmarks

And when I say faster, I mean A LOT faster. Note that all of these benchmarks are run on 1.11.7, with 1 thread (using more than 1 thread it kills performance):

using DynamicPPL, Distributions, LogDensityProblems, Chairmarks, LinearAlgebra
using ADTypes, ForwardDiff, Mooncake, ReverseDiff, Enzyme

const adtypes = [
    ("FD", AutoForwardDiff()),
    ("RD", AutoReverseDiff()),
    ("MC", AutoMooncake()),
    ("EN" => AutoEnzyme(; mode=set_runtime_activity(Reverse), function_annotation=Const))
]

function benchmark_ldfs(model)
    vi = VarInfo(model)
    x = vi[:]
    ldf_no = DynamicPPL.LogDensityFunction(model, getlogjoint, vi)
    fldf_no = DynamicPPL.FastLDF(model, getlogjoint, vi)
    @assert LogDensityProblems.logdensity(ldf_no, x)  LogDensityProblems.logdensity(fldf_no, x)
    print("LogDensityFunction: eval      ----  ")
    display(median(@be LogDensityProblems.logdensity(ldf_no, x)))
    print("           FastLDF: eval      ----  ")
    display(median(@be LogDensityProblems.logdensity(fldf_no, x)))
    for name_adtype in adtypes
        name, adtype = name_adtype
        ldf = DynamicPPL.LogDensityFunction(model, getlogjoint, vi; adtype=adtype)
        fldf = DynamicPPL.FastLDF(model, getlogjoint, vi; adtype=adtype)
        @assert LogDensityProblems.logdensity_and_gradient(ldf, x)[2]  LogDensityProblems.logdensity_and_gradient(fldf, x)[2]
        print("LogDensityFunction: grad ($name) ----  ")
        display(median(@be LogDensityProblems.logdensity_and_gradient(ldf, x)))
        print("           FastLDF: grad ($name) ----  ")
        display(median(@be LogDensityProblems.logdensity_and_gradient(fldf, x)))
    end
end

# ------------------------------------------------------------

# Trivial model.
# The eagle-eyed will notice that compared to previous stages of this PR, the benchmarks are a tiny bit
# worse (~ 10%). That's because I had to include features to handle things like thread safety and dual types.
@model f() = x ~ Normal()
benchmark_ldfs(f())
#=
LogDensityFunction: eval      ----  172.653 ns (6 allocs: 192 bytes)
           FastLDF: eval      ----  10.925 ns
LogDensityFunction: grad (FD) ----  308.464 ns (13 allocs: 496 bytes)
           FastLDF: grad (FD) ----  55.607 ns (3 allocs: 96 bytes)
LogDensityFunction: grad (RD) ----  4.315 μs (82 allocs: 3.062 KiB)
           FastLDF: grad (RD) ----  3.144 μs (46 allocs: 1.562 KiB)
LogDensityFunction: grad (MC) ----  1.099 μs (25 allocs: 1.219 KiB)
           FastLDF: grad (MC) ----  337.402 ns (4 allocs: 192 bytes)
LogDensityFunction: grad (EN) ----  435.262 ns (16 allocs: 560 bytes)
           FastLDF: grad (EN) ----  127.580 ns (2 allocs: 64 bytes)
=#

# ------------------------------------------------------------

# More realistic (albeit still quite small) model.
y = [28, 8, -3, 7, -1, 1, 18, 12]
sigma = [15, 10, 16, 11, 9, 11, 10, 18]
@model function eight_schools(y, sigma)
    mu ~ Normal(0, 5)
    tau ~ truncated(Cauchy(0, 5); lower=0)
    theta ~ MvNormal(fill(mu, length(y)), tau^2 * I)
    for i in eachindex(y)
        y[i] ~ Normal(theta[i], sigma[i])
    end
    return (mu=mu, tau=tau)
end
benchmark_ldfs(eight_schools(y, sigma))
#=
LogDensityFunction: eval      ----  840.676 ns (21 allocs: 1.344 KiB)
           FastLDF: eval      ----  210.714 ns (4 allocs: 256 bytes)
LogDensityFunction: grad (FD) ----  1.544 μs (28 allocs: 5.484 KiB)
           FastLDF: grad (FD) ----  664.634 ns (11 allocs: 2.594 KiB)
LogDensityFunction: grad (RD) ----  41.042 μs (614 allocs: 25.562 KiB)
           FastLDF: grad (RD) ----  39.291 μs (562 allocs: 20.562 KiB)
LogDensityFunction: grad (MC) ----  4.321 μs (64 allocs: 4.016 KiB)
           FastLDF: grad (MC) ----  1.198 μs (12 allocs: 784 bytes)
LogDensityFunction: grad (EN) ----  1.797 μs (44 allocs: 2.609 KiB)
           FastLDF: grad (EN) ----  715.113 ns (13 allocs: 832 bytes)
=#

# Just for comparison. I don't doubt that there are overheads calling Stan from Julia, etc.
using PosteriorDB, LogDensityProblems, Chairmarks
using StanLogDensityProblems: StanProblem
pdb = PosteriorDB.database()
post = PosteriorDB.posterior(pdb, "eight_schools-eight_schools_centered")
prob = StanProblem(post, ".")
x = rand(10)
@be LogDensityProblems.logdensity($prob, $x)
#=
julia> @be LogDensityProblems.logdensity($prob, $x)
Benchmark: 3226 samples with 34 evaluations
 min    821.059 ns (1 allocs: 16 bytes)
 median 848.029 ns (1 allocs: 16 bytes)
 mean   853.905 ns (1 allocs: 16 bytes)
 max    1.395 μs (1 allocs: 16 bytes)
=#

# ------------------------------------------------------------

# One with lots of non-identity VarNames.
@model function badvarnames()
    N = 20
    x = Vector{Float64}(undef, N)
    for i in 1:N
        x[i] ~ Normal()
    end
end
benchmark_ldfs(badvarnames())
#=
LogDensityFunction: eval      ----  1.465 μs (46 allocs: 1.906 KiB)
           FastLDF: eval      ----  358.337 ns (2 allocs: 224 bytes)
LogDensityFunction: grad (FD) ----  4.458 μs (103 allocs: 14.266 KiB)
           FastLDF: grad (FD) ----  2.640 μs (11 allocs: 4.281 KiB)
LogDensityFunction: grad (RD) ----  63.208 μs (1076 allocs: 38.828 KiB)
           FastLDF: grad (RD) ----  51.792 μs (773 allocs: 27.438 KiB)
LogDensityFunction: grad (MC) ----  6.802 μs (160 allocs: 7.000 KiB)
           FastLDF: grad (MC) ----  2.074 μs (28 allocs: 1.094 KiB)
LogDensityFunction: grad (EN) ----  3.181 μs (64 allocs: 6.000 KiB)
           FastLDF: grad (EN) ----  1.693 μs (5 allocs: 2.047 KiB)
=#

# ------------------------------------------------------------

# Submodels too.
@model function inner()
    m ~ Normal(0, 1)
    s ~ Exponential()
    return (m=m, s=s)
end
@model function withsubmodel()
    params ~ to_submodel(inner())
    y ~ Normal(params.m, params.s)
    1.0 ~ Normal(y)
end
benchmark_ldfs(withsubmodel())
#=
LogDensityFunction: eval      ----  921.875 ns (20 allocs: 1.234 KiB)
           FastLDF: eval      ----  105.248 ns
LogDensityFunction: grad (FD) ----  1.191 μs (27 allocs: 2.219 KiB)
           FastLDF: grad (FD) ----  173.485 ns (3 allocs: 112 bytes)
LogDensityFunction: grad (RD) ----  14.500 μs (221 allocs: 9.266 KiB)
           FastLDF: grad (RD) ----  11.188 μs (148 allocs: 5.188 KiB)
LogDensityFunction: grad (MC) ----  5.708 μs (72 allocs: 3.312 KiB)
           FastLDF: grad (MC) ----  575.000 ns (6 allocs: 240 bytes)
LogDensityFunction: grad (EN) ----  2.451 μs (52 allocs: 2.500 KiB)
           FastLDF: grad (EN) ----  328.591 ns (2 allocs: 80 bytes)
=#

I expected it to be faster, but I am frankly amazed myself at just how much faster it is. For these models, this probably brings Turing much closer to Stan (Julia TTFX is still an issue of course).

MCMC sampling

Here is MCMC sampling on the eight schools model, using NUTS which effectively uses nothing but LogDensityFunction in a tight loop. Given the above benchmarks we should expect a 3x or 4x speedup, and that is indeed what we get, with the only code change being LogDensityFunction -> FastLDF in src/mcmc/hmc.jl.

Notice that the DPPL benchmarks above only used unlinked parameters, but this also confirms that linked parameters have similar speedups:

using Turing, Random, Mooncake

for adtype in [AutoForwardDiff(), AutoMooncake()]
    @time chain = sample(Xoshiro(468), eight_schools(y, sigma), NUTS(; adtype=adtype), 1000; progress=false)
end

#=
LogDensityFunction ForwardDiff - 0.556647 seconds (8.00 M allocations: 774.359 MiB, 9.03% gc time)
           FastLDF ForwardDiff - 0.190099 seconds (3.47 M allocations: 393.619 MiB, 12.40% gc time)
LogDensityFunction Mooncake    - 0.794813 seconds (10.64 M allocations: 708.145 MiB, 5.49% gc time)
           FastLDF Mooncake    - 0.216566 seconds (3.59 M allocations: 212.214 MiB, 5.68% gc time)
=#

Following up from the Stan claim in the previous section: It's still not quite the same. For eight-schools, with Stan I managed to sample 20000 iterations in 0.50 seconds, and the fastest I could squeeze out of Turing/EnzymeReverse was 1.49 seconds with FastLDF. Still, that's WAY better than with LogDensityFunction, which takes 8.8 seconds.

(For what it's worth, the non-centred eight-schools model runs in 0.5 seconds with Enzyme/Turing/FastLDF, but Stan gets down to 0.26 seconds... Zeno's paradox, argh!)

Remove unflatten?!

I have yet to investigate this fully, but I strongly believe that we can actually completely get rid of unflatten now. unflatten only ever served one purpose, which was to obey the LogDensityProblems interface. But FastLDF does that without using unflatten.

Note that if we want to convert a vector of parameters into a Dict of parameters (say, after we finish sampling), we can reuse the FastLogDensityAt code and just tack on ValuesAsInModelAccumulator. No need to use unflatten + evaluate!!.

Next steps

I'm happy with where this PR is. See #1120 for a checklist of things to investigate before this becomes the 'official' LDF.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

Benchmark Report for Commit 8715446

Computer Information

Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Benchmark Results

┌───────────────────────┬───────┬─────────────┬───────────────────┬────────┬────────────────┬─────────────────┐
│                 Model │   Dim │  AD Backend │           VarInfo │ Linked │ t(eval)/t(ref) │ t(grad)/t(eval) │
├───────────────────────┼───────┼─────────────┼───────────────────┼────────┼────────────────┼─────────────────┤
│ Simple assume observe │     1 │ forwarddiff │             typed │  false │            6.6 │             1.8 │
│           Smorgasbord │   201 │ forwarddiff │             typed │  false │          740.3 │            43.3 │
│           Smorgasbord │   201 │ forwarddiff │ simple_namedtuple │   true │          415.3 │            56.3 │
│           Smorgasbord │   201 │ forwarddiff │           untyped │   true │          807.3 │            38.5 │
│           Smorgasbord │   201 │ forwarddiff │       simple_dict │   true │         6576.3 │            27.2 │
│           Smorgasbord │   201 │ forwarddiff │      typed_vector │   true │          807.5 │            40.0 │
│           Smorgasbord │   201 │ forwarddiff │    untyped_vector │   true │          809.7 │            37.0 │
│           Smorgasbord │   201 │ reversediff │             typed │   true │          910.8 │            46.2 │
│           Smorgasbord │   201 │    mooncake │             typed │   true │          733.3 │             5.9 │
│           Smorgasbord │   201 │      enzyme │             typed │   true │          919.9 │             4.0 │
│    Loop univariate 1k │  1000 │    mooncake │             typed │   true │         3990.7 │             5.8 │
│       Multivariate 1k │  1000 │    mooncake │             typed │   true │         1037.2 │             8.9 │
│   Loop univariate 10k │ 10000 │    mooncake │             typed │   true │        43927.4 │             5.3 │
│      Multivariate 10k │ 10000 │    mooncake │             typed │   true │         8782.4 │             9.9 │
│               Dynamic │    10 │    mooncake │             typed │   true │          123.9 │            11.4 │
│              Submodel │     1 │    mooncake │             typed │   true │            8.8 │             6.6 │
│                   LDA │    12 │ reversediff │             typed │   true │         1001.9 │             2.1 │
└───────────────────────┴───────┴─────────────┴───────────────────┴────────┴────────────────┴─────────────────┘

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 89.76378% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.72%. Comparing base (08fffa2) to head (8715446).

Files with missing lines Patch % Lines
src/fasteval.jl 90.74% 10 Missing ⚠️
src/compiler.jl 85.71% 2 Missing ⚠️
ext/DynamicPPLEnzymeCoreExt.jl 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1113      +/-   ##
==========================================
+ Coverage   81.48%   81.72%   +0.24%     
==========================================
  Files          40       41       +1     
  Lines        3845     3956     +111     
==========================================
+ Hits         3133     3233     +100     
- Misses        712      723      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

DynamicPPL.jl documentation for PR #1113 is available at:
https://TuringLang.github.io/DynamicPPL.jl/previews/PR1113/

@penelopeysm penelopeysm marked this pull request as draft November 6, 2025 02:34
@penelopeysm penelopeysm marked this pull request as ready for review November 6, 2025 03:27
@mhauru
Copy link
Member

mhauru commented Nov 6, 2025

OMG. PRotY?

@sethaxen
Copy link
Member

sethaxen commented Nov 6, 2025

This is super exciting! Is there any reason for FastLDF to be a separate struct from LogDensityFunction instead of replacing it?

@mhauru
Copy link
Member

mhauru commented Nov 6, 2025

I was confused as to why this made such a huge difference given that the logic of what gets eventually executed seems quite similar to how Metadata works: There's a dict with VarNames as keys that gives you the range and the link status (in Metadata with an extra level of indirection because of looking up Vector elements, but that should be pretty cheap), and we use those to index into the vector of values and get the transform.

The only obvious difference I could see was the use of @view when indexing vals. I stuck a similar @view into getindex(_internal) of Metadata and that cut the allocations it made for the trivial test model from 6 to 4, and the runtime by about a third as well. So that makes sense in the light of FastLDF having zero allocations: Most of the time is spent doing that alloc.

Now the question is, why does our current Metadata still do 4 allocations even when using @view. EDIT: The following is wrong, thanks @nsiccha. I had not understood this before, but turns out sticking the Vector of values given as input to the LDF into any sort of struct causes an allocation, even when you don't copy the vector. Even this allocates:

# This is benchmarking done wrong.
julia> module MiniBench
       using Chairmarks
       v = randn(8)
       f(v) = (v,)
       println(@b f(v))
       end
Sample(evals=831, time=2.261371841155235e-8, allocs=1, bytes=16)

This makes me want to rethink VarInfo again from the ground up, try to see what's the minimal structure we can get away with if we use this sort of approach where we separate holding the values of variables from holding information about how individual variable values are read. Need to think through any edge cases, but I'm prrrretty excited for this.

@nsiccha
Copy link

nsiccha commented Nov 6, 2025

@mhauru I think in your minibenchmark, the allocation is due to accessing the non-constant global variable v. Marking v as const, interpolating it into the benchmark, or using @b randn(8) f gets rid of the allocation for me.

accs::Accs
end
DynamicPPL.getaccs(vi::OnlyAccsVarInfo) = vi.accs
DynamicPPL.maybe_invlink_before_eval!!(vi::OnlyAccsVarInfo, ::Model) = vi
Copy link
Member Author

@penelopeysm penelopeysm Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that a substantial amount of performance gains were in this innocuous line 😄 Just optimising the implementation of this function for normal VarInfo, plus sticking some @views in appropriate places, gets the old LogDensityFunction to within ~ 50% of FastLDF's performance. See #1115.

In some ways the simplification here is a result of the current approach: because the linked status is no longer stored in a VarInfo object but rather a separate struct, there is no longer any need to care about linking / invlinking a VarInfo.

But overall I think this means that the performance gains from the current approach are much more comprehensible.

@penelopeysm
Copy link
Member Author

Is there any reason for FastLDF to be a separate struct from LogDensityFunction instead of replacing it?

Not really. There's one model above where ForwardDiff is a bit slower, but my suspicion is that that could be optimised away in a similar way to declaring zero derivatives for Mooncake/Enzyme. I think the goal should definitely be to replace LDF.

(At the PR stage though, having both structs does make benchmarking much easier 😉)

@mhauru
Copy link
Member

mhauru commented Nov 6, 2025

There's a lot to be said about this PR. However, my most prominent lesson so far is that

  1. We can really squeeze down the allocations caused by our unflatten and evaluate!! calls to zero.
  2. That doing so makes a very big difference to the runtime of small models.
  3. That doing so makes a surprisinly big difference to the runtime of even large models.

Point 3. is very much worth underlining. My first thought was that this is cool for small models, but no, this matters for all models. Here's Penny's benchmarking run on a slightly bigger model:

       @model function f()
           dim = 1_000
           x ~ MvNormal(zeros(dim), I)
           y ~ product_distribution([Exponential() for _ in 1:dim])
           0.0 ~ Normal(sum(x), sum(y))
           return nothing
       end
       benchmark_ldfs(f())

       end
LogDensityFunction: eval      ----  10.042 μs (25 allocs: 56.766 KiB)
           FastLDF: eval      ----  6.531 μs (10 allocs: 24.234 KiB)
LogDensityFunction: grad (FD) ----  17.710 ms (4184 allocs: 107.243 MiB, 34.66% gc time)
           FastLDF: grad (FD) ----  8.143 ms (1674 allocs: 23.538 MiB, 33.01% gc time)
LogDensityFunction: grad (RD) ----  1.377 ms (23146 allocs: 985.000 KiB)
           FastLDF: grad (RD) ----  2.313 ms (38082 allocs: 1.430 MiB)

The above numbers scale linearly in the dimension of the model (at least eval, I didn't look at grads).

As I mentioned above, I was confused why the changes in this PR made such a difference, when the logic in the pipeline of what is done to the input vector ends up being quite similar. To understand that, @penelopeysm and I made a PR that gets allocations down to 0 for small models using trusty old Metadata: #1115. Comparing the code in that PR vs FastLDF, they are on par for large models:

       @model function f()
           dim = 1_000
           x ~ MvNormal(zeros(dim), I)
           y ~ product_distribution([Exponential() for _ in 1:dim])
           0.0 ~ Normal(sum(x), sum(y))
           return nothing
       end
       benchmark_ldfs(f())

       end
LogDensityFunction: eval      ----  6.709 μs (10 allocs: 24.234 KiB)
           FastLDF: eval      ----  6.458 μs (10 allocs: 24.234 KiB)
LogDensityFunction: grad (FD) ----  7.024 ms (2347 allocs: 23.806 MiB, 13.95% gc time)
           FastLDF: grad (FD) ----  5.223 ms (1674 allocs: 23.538 MiB)
LogDensityFunction: grad (RD) ----  2.323 ms (38097 allocs: 1.448 MiB)
           FastLDF: grad (RD) ----  2.290 ms (38082 allocs: 1.430 MiB)

while FastLDF still has an edge, but a smaller edge, for small models:

       @model function f()
           x ~ Normal()
           0.0 ~ Normal(x)
       end
       benchmark_ldfs(f())

       end
LogDensityFunction: eval      ----  40.790 ns
           FastLDF: eval      ----  16.743 ns
LogDensityFunction: grad (FD) ----  239.837 ns (7 allocs: 272 bytes)
           FastLDF: grad (FD) ----  61.845 ns (3 allocs: 96 bytes)
LogDensityFunction: grad (RD) ----  6.625 μs (107 allocs: 4.000 KiB)
           FastLDF: grad (RD) ----  5.775 μs (81 allocs: 2.781 KiB)

That's the sort of difference and scaling I can understand, given the changes to internal data structures.

Now that we know we can do 0 allocation evals, unless there's something deeply conceptually problematic about the use of @view, I think nothing else will do anymore. We can finalise this PR or #1115 or some other version of this, but I expect to soon stick in our tests a line that says @test @allocations(LogDensityProblems.logdensity(ldf_for_trivial_model, x)) == 0 and nothing will be merged unless that passes.

struct FastLDFContext{N<:NamedTuple,T<:AbstractVector{<:Real}} <: AbstractContext
# The ranges of identity VarNames are stored in a NamedTuple for improved performance
# (it's around 1.5x faster).
iden_varname_ranges::N
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When talking with @penelopeysm, I did a hasty check of whether this mechanism makes a substantial difference to performance, as opposed to having them all in varname_ranges as a Dict. My conclusions was no, hers was yes. It could be that I botched my check, but also I was running 1.12.1 and she's on 1.11.7, so we should check if that's significant.

@penelopeysm
Copy link
Member Author

penelopeysm commented Nov 6, 2025

Another upshot of this is that, because FastLDF no longer relies on a particular kind of VarInfo (it only uses it at the beginning to construct the ranges), it means that AD doesn't need to handle every kind of SimpleVarInfo, etc etc. It only needs to handle OnlyAccsVarInfo i.e. an accumulator tuple.

Copy link
Member Author

@penelopeysm penelopeysm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this is ready for proper review. I've elected to keep it named as FastLDF, under the DynamicPPL.Experimental module, and to therefore use a patch release. The reasons for this are:

  1. We can move fast and break things in patch releases, without worrying about contaminating the breaking branch with something that isn't yet ready for release.

  2. With both structs side by side, we can not only benchmark them against each other, but also use them for correctness tests. I think that's quite worth having especially in this nascent period where things look really good but we're unsure if we messed up some weird edge case.

  3. Although it's fairly close, it's IMO not production ready yet. There are a few things I'd still like to sort out before we replace LDF, which I'll list in #1120.

Comment on lines +429 to +440
function get_ranges_and_linked(varinfo::VarInfo{<:NamedTuple{syms}}) where {syms}
all_iden_ranges = NamedTuple()
all_ranges = Dict{VarName,RangeAndLinked}()
offset = 1
for sym in syms
md = varinfo.metadata[sym]
this_md_iden, this_md_others, offset = get_ranges_and_linked_metadata(md, offset)
all_iden_ranges = merge(all_iden_ranges, this_md_iden)
all_ranges = merge(all_ranges, this_md_others)
end
return all_iden_ranges, all_ranges
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function, and the ranges it extracts, is written in a way to be 'compatible' with what unflatten / vi[:] currently does. However, I would like to believe that in the future there will not even be any reason to even use this. If you think about it, the ranges are completely meaningless:

Dict(@varname(a) => 1:1, @varname(b) => 2:2, @varname(c) => 3:5)

really, at the end of the day, describes the same thing as

Dict(@varname(b) => 1:1, @varname(c) => 2:4, @varname(a) => 5:5)

It just amounts to a permutation of the parameter vector, and the exact permutation doesn't matter! Of course, what really matters is that it's consistent between different bits of the code. However, I would actually like to propose that we (one day) get rid of this dependence on the internal Metadata structure, and instead do something like this:

  1. Get a sorted list of all VarNames.
  2. Get their lengths.
  3. The parameter vector is just those VarNames in ascending order.

This would greatly simplify the code here, and probably also simplify the code for varinfo[:]. It WOULD make unflatten a pain. But that's why I say 'one day': I am really hoping that this PR is the one that sounds the death knell for unflatten.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that came to mind: It might make a difference for performance to have the elements of the vals vector be roughly in the order in which they are read, to make use of caching and such. If e.g. x[1] and x[2] are at different ends, then a loop over x[i] might cause very inefficient memory reads. This isn't incompatible with the above, just a thing to keep in mind when redesigning variable order.

@penelopeysm penelopeysm requested a review from mhauru November 6, 2025 22:11
@mhauru
Copy link
Member

mhauru commented Nov 10, 2025

I need to read the PR properly, but based on skimming, I'm still a bit confused by the performance gains here. FastEvalVectorContext is very close to a simplified Metadata object: It stores a reference to the parameter vector and pre-existing ranges and link statuses, and indexes the parameter vector in tilde_assume!! based on those ranges. On this level of detail, that description applies equally well to calling unflatten on an untyped Metadata. The one thing I can think of that is quite different, is that a typed Metadata uses @view to split the input vector into subranges by the lead symbol straight away in the unflatten call, whereas FastEvalVectorContext keeps it all in a single vector until what is effectively a getindex_internal call in its tilde_assume!! method. Is that what makes the big difference? Or am I missing some other significant difference?

@penelopeysm
Copy link
Member Author

Just a note to say that I tried with FixedSizeArrays as inputs and things were sometimes marginally faster but mostly marginally slower (especially AD, not sure if there might be some potential for optimisation there).

* Make InitContext work with OnlyAccsVarInfo

* Do not convert NamedTuple to Dict

* remove logging

* Enable InitFromPrior and InitFromUniform too

* Fix `infer_nested_eltype` invocation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants