-
Notifications
You must be signed in to change notification settings - Fork 43
Description
I propose adding a new function implementing the "Improved Sheather-Jones" bandwidth selector, which works well for multimodal distributions, and a new function kde_reflected, which works well for bounded distributions.
Background
Silverman's bandwidth (used here as the default) is a slightly undersmoothed variant of the optimal bandwidth if the true density is normal. It's known to be quite a bad choice for multimodal distributions with well-separated modes. The "improved Sheather-Jones" bandwidth, implemented e.g. in KDEpy and arviz, works well for a wide variety of distributions (see Table 1 of the paper), including those with well-separated modes and that depart from normality. It matches or outperforms the Sheather-Jones bandwidth selector, which is recommended by R's density and used as the default in ggdist. PosteriorStats.jl's main branch implements this as isj_bandwidth.
Also, the standard KDE as implemented here estimates the true density poorly near the bounds of bounded densities with large amounts of density near the bounds (e.g. a half-normal). While many KDE variants exist for boundary-correction, many of them use specialized location-dependent kernels. A much simpler solution, which is used e.g. by KDEpy, arviz, is boundary correction via reflection/mirroring, where the density outside of the bounds is reflected at the bounds and added to the density near the bounds. Given a standard KDE f and lower and upper bounds l and u, the reflected KDE is f_reflected(x) = f(x) + f(2l - x) + f(2u - x). This approach performs quite well and is simple to implement. On PosteriorStats.jl's main branch, this is implemented as kde_reflected.
Proposed changes
I propose adding both of these to this package. Currently default_bandwidth is not part of the API. Perhaps we could rename default_bandwidth to bandwidth_silverman and then have default_bandwidth call this function? Then bandwidth_silverman and bandwidth_isj could be added to the API. It would also be convenient to be able to provide just a bandwidth function (with signature f(::AbstractVector{<:Real})::Real) to bandwidth, which is then called by the KDE function.
Currently this package already includes kde and kde_lscv instead of trying to have a single kde function with many options. Perhaps it makes sense to include a kde_reflected method as well, which would be more-or-less as implemented in PosteriorStats.jl. This would default to using the bounds of the data as the bounds of the reflected density but allows the user to specify natural bounds.
Examples
ISJ bandwidth vs. Silverman bandwidth
This example shows that for normal-like distributions, the two bandwidths perform similarly, but for distributions with heavier tails or well-separated modes, the ISJ bandwidth is better.
using Random, Distributions, CairoMakie, KernelDensity, PosteriorStats #main
Random.seed!(42)
dists = [
"Normal" => Normal(),
"t(df=3)" => TDist(3),
"mixture(Normal(0, 1), Normal(300, 1))" => MixtureModel([Normal(0, 1), Normal(300, 1)]),
"logNormal" => LogNormal()
]
fig = Figure(size=(800, 600));
ga_plots = fig[1, 1] = GridLayout()
axs = [Axis(ga_plots[i, j]) for i in 1:2 for j in 1:2]
for (i, (name, dist)) in enumerate(dists)
ax = axs[i]
ax.xlabel = "x"
ax.ylabel = "density"
ax.title = name
samples = rand(dist, 1_000)
bw_silverman = KernelDensity.default_bandwidth(samples)
bw_isj = PosteriorStats.isj_bandwidth(samples)
kde_silverman = kde(samples; bandwidth=bw_silverman)
kde_isj = kde(samples; bandwidth=bw_isj)
plot!(ax, dist; color=:black, label="True")
plot!(ax, kde_silverman; color=:blue, label="Silverman", alpha=0.7)
plot!(ax, kde_isj; color=:orange, label="ISJ", alpha=0.7)
end
Legend(fig[2, 1], axs[1]; tellheight=true, tellwidth=false, orientation=:horizontal)
figReflected KDE vs. standard KDE
This example demonstrates that near the bounds of bounded densities, the reflected KDE performs better than the standard KDE, while for unbounded densities the two perform similarly. It will tend to overestimate the density near a bound if the density decays quickly to 0 near the bound.
using Random, Distributions, CairoMakie, KernelDensity, PosteriorStats #main
Random.seed!(42)
dists = [
"Normal" => Normal(),
"HalfNormal" => truncated(Normal(); lower=0),
"Exponential" => Exponential(),
"Uniform" => Uniform(0, 1),
"Beta(2, 5)" => Beta(2, 5),
"Beta(2, 2)" => Beta(2, 2),
]
fig = Figure(size=(1200, 600));
ga_plots = fig[1, 1] = GridLayout()
axs = [Axis(ga_plots[i, j]) for i in 1:2 for j in 1:3]
for (i, (name, dist)) in enumerate(dists)
ax = axs[i]
ax.xlabel = "x"
ax.ylabel = "density"
ax.title = name
samples = rand(dist, 10_000)
bw_silverman = KernelDensity.default_bandwidth(samples)
bw_isj = PosteriorStats.isj_bandwidth(samples)
kde_standard = kde(samples; bandwidth=bw_silverman)
kde_reflected = PosteriorStats.kde_reflected(samples; bandwidth=bw_silverman)
plot!(ax, dist; color=:black, label="True")
plot!(ax, kde_standard; color=:blue, label="Standard", alpha=0.7)
plot!(ax, kde_reflected; color=:orange, label="Reflected", alpha=0.7)
end
Legend(fig[2, 1], axs[1]; tellheight=true, tellwidth=false, orientation=:horizontal)
figComparison of bandwidth selector runtimes
This benchmark shows that ISJ is slower than Silverman for smallish samples but is no slower for large samples, I'm guessing because it is O(N), while the quantile sort required to compute IQR computed in default_bandwidth is I think O(N * log(N)).
using Random, BenchmarkTools, KernelDensity, PosteriorStats #main
Random.seed!(42)
# smallish sample
x = randn(1_000)
@btime $(KernelDensity.default_bandwidth)($x)
@btime $(PosteriorStats.isj_bandwidth)($x)
# large sample
x = randn(100_000)
@btime $(KernelDensity.default_bandwidth)($x)
@btime $(PosteriorStats.isj_bandwidth)($x) 8.251 μs (7 allocations: 8.03 KiB)
1.436 ms (364 allocations: 280.78 KiB)
4.079 ms (7 allocations: 781.47 KiB)
3.551 ms (364 allocations: 280.78 KiB)