Skip to content

Adding Improved Sheather-Jones bandwidth and reflection KDE #125

@sethaxen

Description

@sethaxen

I propose adding a new function implementing the "Improved Sheather-Jones" bandwidth selector, which works well for multimodal distributions, and a new function kde_reflected, which works well for bounded distributions.

Background

Silverman's bandwidth (used here as the default) is a slightly undersmoothed variant of the optimal bandwidth if the true density is normal. It's known to be quite a bad choice for multimodal distributions with well-separated modes. The "improved Sheather-Jones" bandwidth, implemented e.g. in KDEpy and arviz, works well for a wide variety of distributions (see Table 1 of the paper), including those with well-separated modes and that depart from normality. It matches or outperforms the Sheather-Jones bandwidth selector, which is recommended by R's density and used as the default in ggdist. PosteriorStats.jl's main branch implements this as isj_bandwidth.

Also, the standard KDE as implemented here estimates the true density poorly near the bounds of bounded densities with large amounts of density near the bounds (e.g. a half-normal). While many KDE variants exist for boundary-correction, many of them use specialized location-dependent kernels. A much simpler solution, which is used e.g. by KDEpy, arviz, is boundary correction via reflection/mirroring, where the density outside of the bounds is reflected at the bounds and added to the density near the bounds. Given a standard KDE f and lower and upper bounds l and u, the reflected KDE is f_reflected(x) = f(x) + f(2l - x) + f(2u - x). This approach performs quite well and is simple to implement. On PosteriorStats.jl's main branch, this is implemented as kde_reflected.

Proposed changes

I propose adding both of these to this package. Currently default_bandwidth is not part of the API. Perhaps we could rename default_bandwidth to bandwidth_silverman and then have default_bandwidth call this function? Then bandwidth_silverman and bandwidth_isj could be added to the API. It would also be convenient to be able to provide just a bandwidth function (with signature f(::AbstractVector{<:Real})::Real) to bandwidth, which is then called by the KDE function.

Currently this package already includes kde and kde_lscv instead of trying to have a single kde function with many options. Perhaps it makes sense to include a kde_reflected method as well, which would be more-or-less as implemented in PosteriorStats.jl. This would default to using the bounds of the data as the bounds of the reflected density but allows the user to specify natural bounds.

Examples

ISJ bandwidth vs. Silverman bandwidth

This example shows that for normal-like distributions, the two bandwidths perform similarly, but for distributions with heavier tails or well-separated modes, the ISJ bandwidth is better.

using Random, Distributions, CairoMakie, KernelDensity, PosteriorStats #main
Random.seed!(42)

dists = [
    "Normal" => Normal(),
    "t(df=3)" => TDist(3),
    "mixture(Normal(0, 1), Normal(300, 1))" => MixtureModel([Normal(0, 1), Normal(300, 1)]),
    "logNormal" => LogNormal()
]

fig = Figure(size=(800, 600));
ga_plots = fig[1, 1] = GridLayout()
axs = [Axis(ga_plots[i, j]) for i in 1:2 for j in 1:2]

for (i, (name, dist)) in enumerate(dists)
    ax = axs[i]
    ax.xlabel = "x"
    ax.ylabel = "density"
    ax.title = name
    samples = rand(dist, 1_000)
    bw_silverman = KernelDensity.default_bandwidth(samples)
    bw_isj = PosteriorStats.isj_bandwidth(samples)
    kde_silverman = kde(samples; bandwidth=bw_silverman)
    kde_isj = kde(samples; bandwidth=bw_isj)
    plot!(ax, dist; color=:black, label="True")
    plot!(ax, kde_silverman; color=:blue, label="Silverman", alpha=0.7)
    plot!(ax, kde_isj; color=:orange, label="ISJ", alpha=0.7)
end
Legend(fig[2, 1], axs[1]; tellheight=true, tellwidth=false, orientation=:horizontal)
fig

bw

Reflected KDE vs. standard KDE

This example demonstrates that near the bounds of bounded densities, the reflected KDE performs better than the standard KDE, while for unbounded densities the two perform similarly. It will tend to overestimate the density near a bound if the density decays quickly to 0 near the bound.

using Random, Distributions, CairoMakie, KernelDensity, PosteriorStats #main
Random.seed!(42)

dists = [
    "Normal" => Normal(),
    "HalfNormal" => truncated(Normal(); lower=0),
    "Exponential" => Exponential(),
    "Uniform" => Uniform(0, 1),
    "Beta(2, 5)" => Beta(2, 5),
    "Beta(2, 2)" => Beta(2, 2),
]

fig = Figure(size=(1200, 600));
ga_plots = fig[1, 1] = GridLayout()
axs = [Axis(ga_plots[i, j]) for i in 1:2 for j in 1:3]

for (i, (name, dist)) in enumerate(dists)
    ax = axs[i]
    ax.xlabel = "x"
    ax.ylabel = "density"
    ax.title = name
    samples = rand(dist, 10_000)
    bw_silverman = KernelDensity.default_bandwidth(samples)
    bw_isj = PosteriorStats.isj_bandwidth(samples)
    kde_standard = kde(samples; bandwidth=bw_silverman)
    kde_reflected = PosteriorStats.kde_reflected(samples; bandwidth=bw_silverman)
    plot!(ax, dist; color=:black, label="True")
    plot!(ax, kde_standard; color=:blue, label="Standard", alpha=0.7)
    plot!(ax, kde_reflected; color=:orange, label="Reflected", alpha=0.7)
end
Legend(fig[2, 1], axs[1]; tellheight=true, tellwidth=false, orientation=:horizontal)
fig

reflect

Comparison of bandwidth selector runtimes

This benchmark shows that ISJ is slower than Silverman for smallish samples but is no slower for large samples, I'm guessing because it is O(N), while the quantile sort required to compute IQR computed in default_bandwidth is I think O(N * log(N)).

using Random, BenchmarkTools, KernelDensity, PosteriorStats #main
Random.seed!(42)

# smallish sample
x = randn(1_000)
@btime $(KernelDensity.default_bandwidth)($x)
@btime $(PosteriorStats.isj_bandwidth)($x)

# large sample
x = randn(100_000)
@btime $(KernelDensity.default_bandwidth)($x)
@btime $(PosteriorStats.isj_bandwidth)($x)
  8.251 μs (7 allocations: 8.03 KiB)
  1.436 ms (364 allocations: 280.78 KiB)
  4.079 ms (7 allocations: 781.47 KiB)
  3.551 ms (364 allocations: 280.78 KiB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions