Skip to content

Commit 5c21572

Browse files
fonnesbeckChris FonnesbeckOriolAbril
authored
Update LGCP notebook to v4 (#354)
* Draft update of BNN notebook * Pre-commit fixes * Address reviewer comments * Draft LGCP notebook update * Re-run notebook * try fixing pre-commit Co-authored-by: Chris Fonnesbeck <cfonnesbeck@phillies.com> Co-authored-by: Oriol (ZBook) <oriol.abril.pla@gmail.com>
1 parent e332822 commit 5c21572

File tree

3 files changed

+886
-1132
lines changed

3 files changed

+886
-1132
lines changed

examples/case_studies/log-gaussian-cox-process.ipynb

Lines changed: 0 additions & 1052 deletions
This file was deleted.

examples/gaussian_processes/log-gaussian-cox-process.ipynb

Lines changed: 854 additions & 0 deletions
Large diffs are not rendered by default.

myst_nbs/case_studies/log-gaussian-cox-process.myst.md renamed to myst_nbs/gaussian_processes/log-gaussian-cox-process.myst.md

Lines changed: 32 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,20 @@ jupytext:
66
format_version: 0.13
77
jupytext_version: 1.13.7
88
kernelspec:
9-
display_name: gbi_env_py38
9+
display_name: Python 3 (ipykernel)
1010
language: python
11-
name: gbi_env_py38
11+
name: python3
1212
---
1313

14+
(log-gaussian-cox-process)=
1415
# Modeling spatial point patterns with a marked log-Gaussian Cox process
1516

17+
:::{post} May 31, 2022
18+
:tags: cox process, latent gaussian process, nonparametric, spatial, count data
19+
:category: intermediate
20+
:author: Chrisopher Krapu, Chris Fonnesbeck
21+
:::
22+
1623
+++
1724

1825
## Introduction
@@ -30,7 +37,7 @@ where $GP(\mu(s), K(s,s'))$ denotes a Gaussian process with mean function $\mu(s
3037
* What would randomly sampled patterns with the same statistical properties look like?
3138
* Is there a statistical correlation between the *frequency* and *magnitude* of point events?
3239

33-
In this notebook, we'll use a grid-based approximation to the full LGCP with PyMC3 to fit a model and analyze its posterior summaries. We will also explore the usage of a marked Poisson process, an extension of this model to account for the distribution of *marks* associated with each data point.
40+
In this notebook, we'll use a grid-based approximation to the full LGCP with PyMC to fit a model and analyze its posterior summaries. We will also explore the usage of a marked Poisson process, an extension of this model to account for the distribution of *marks* associated with each data point.
3441

3542
+++
3643

@@ -41,15 +48,20 @@ In this notebook, we'll use a grid-based approximation to the full LGCP with PyM
4148
Our observational data concerns 231 sea anemones whose sizes and locations on the French coast were recorded. This data was taken from the [`spatstat` spatial modeling package in R](https://github.com/spatstat/spatstat) which is designed to address models like the LGCP and its subsequent refinements. The original source of this data is the textbook *Spatial data analysis by example* by Upton and Fingleton (1985) and a longer description of the data can be found there.
4249

4350
```{code-cell} ipython3
51+
import warnings
52+
4453
from itertools import product
4554
4655
import arviz as az
4756
import matplotlib.pyplot as plt
4857
import numpy as np
4958
import pandas as pd
50-
import pymc3 as pm
59+
import pymc as pm
5160
61+
from matplotlib import MatplotlibDeprecationWarning
5262
from numpy.random import default_rng
63+
64+
warnings.filterwarnings(action="ignore", category=MatplotlibDeprecationWarning)
5365
```
5466

5567
```{code-cell} ipython3
@@ -71,8 +83,9 @@ data.head(3)
7183
Let's take a look at this data in 2D space:
7284

7385
```{code-cell} ipython3
74-
plt.scatter(data["x"], data["y"], c=data["marks"]),
75-
plt.colorbar(label="Anemone size"), plt.axis("equal");
86+
plt.scatter(data["x"], data["y"], c=data["marks"])
87+
plt.colorbar(label="Anemone size")
88+
plt.axis("equal");
7689
```
7790

7891
The 'marks' column indicates the size of each anemone. If we were to model both the marks and the spatial distribution of points, we would be modeling a *marked Poisson point process*. Extending the basic point pattern model to include this feature is the second portion of this notebook.
@@ -131,7 +144,7 @@ for i, row in enumerate(centroids):
131144
plt.title("Anemone counts per grid cell"), plt.colorbar(label="Anemone size");
132145
```
133146

134-
We can see that all of the counts are fairly low and range from zero to five. With all of our data prepared, we can go ahead and start writing out our probabilistic model in PyMC3. We are going to treat each of the per-cell counts $Y_1,...Y_M$ above as a Poisson random variable.
147+
We can see that all of the counts are fairly low and range from zero to five. With all of our data prepared, we can go ahead and start writing out our probabilistic model in PyMC. We are going to treat each of the per-cell counts $Y_1,...Y_M$ above as a Poisson random variable.
135148

136149
+++
137150

@@ -167,7 +180,7 @@ With the model fully specified, we can start sampling from the posterior using t
167180

168181
```{code-cell} ipython3
169182
with lgcp_model:
170-
trace = pm.sample(target_accept=0.95, chains=4, return_inferencedata=True)
183+
trace = pm.sample(1000, tune=2000, target_accept=0.95)
171184
```
172185

173186
# Interpreting the results
@@ -180,7 +193,7 @@ Posterior inference on the length_scale parameter is useful for understanding wh
180193
az.summary(trace, var_names=["mu", "rho"])
181194
```
182195

183-
We are also interested in looking at the value of the intensity field at a large number of new points in space. We can accommodate this within our model by including a new random variable for the latent Gaussian process evaluated at a denser set of points. Using `sample_posterior_predictive`, we generate posterior predictions on new data points contained in the variable `intensity_new`.
196+
We are also interested in looking at the value of the intensity field at a large number of new points in space. We can accommodate this within our model by including a new random variable for the latent Gaussian process evaluated at a denser set of points. Using `sample_posterior_predictive`, we generate posterior predictions on new data points contained in the variable `intensity_new`.
184197

185198
```{code-cell} ipython3
186199
x_new = np.linspace(5, 275, 20)
@@ -194,10 +207,9 @@ with lgcp_model:
194207
spp_trace = pm.sample_posterior_predictive(
195208
trace, var_names=["log_intensity_new"], keep_size=True
196209
)
197-
trace.extend(
198-
az.from_dict(posterior_predictive=spp_trace, dims={"log_intensity_new": ["sample"]})
199-
)
200-
intensity_samples = np.exp(trace.posterior_predictive["log_intensity_new"])
210+
211+
trace.extend(spp_trace)
212+
intensity_samples = np.exp(trace.posterior_predictive["log_intensity_new"])
201213
```
202214

203215
Let's take a look at a few realizations of $\lambda(s)$. Since the samples are on the log scale, we'll need to exponentiate them to obtain the spatial intensity field of our 2D Poisson process. In the plot below, the observed point pattern is overlaid.
@@ -267,77 +279,17 @@ The posterior variance is lowest in the middle of the domain and largest in the
267279

268280
+++
269281

270-
# Extending the model to include marks
282+
## Authors
271283

272-
+++
273-
274-
A shortcoming of the model from the previous section is that it is only modeling the number of points within a spatial domain and has no ability to represent the *mark* associated with each point. Here, the mark refers to the size of each anemone. We'll augment the model from the previous section to see whether the intensity field $\lambda(s)$ also has an impact on the anemone size. To do this, we will model the mark size as a linear function of the intensity field with a normal likelihood. This is nearly identical to the model described in [this paper](https://hal.archives-ouvertes.fr/hal-00622140/document) by Ho and Stoyan (2008).
284+
- This notebook was written by [Christopher Krapu](https://github.com/ckrapu) on September 6, 2020 and updated on April 1, 2021.
285+
- Updated by Chris Fonnesbeck on May 31, 2022 for v4 compatibility.
275286

276-
+++
277-
278-
The first part of the extended model is exactly as before save for the fact that we must evaluate the GP at not just the centroids of the grid cells but also at the location of each anemone. `augmented_coordinates` includes both of these sets of points in a single array.
279-
280-
```{code-cell} ipython3
281-
augmented_coordinates = np.vstack([centroids, data[["x", "y"]].values])
282-
283-
n_centroids = centroids.shape[0]
284-
285-
with pm.Model() as mark_model:
286-
mu = pm.Normal("mu", sigma=3)
287-
rho = pm.Uniform("rho", lower=25, upper=200)
288-
289-
cov_scale = pm.InverseGamma("scale", alpha=1, beta=1)
290-
291-
cov_func = cov_scale * pm.gp.cov.Matern52(2, ls=rho)
292-
mean_func = pm.gp.mean.Constant(mu)
293-
intensity_gp = pm.gp.Latent(mean_func=mean_func, cov_func=cov_func)
294-
295-
log_intensity = intensity_gp.prior("log_intensity", X=augmented_coordinates)
296-
intensity = pm.math.exp(log_intensity)
297-
298-
rates = intensity[0:n_centroids] * area_per_cell
299-
counts = pm.Poisson("counts", mu=rates, observed=cell_counts)
300-
```
301-
302-
We can now add on an extension of the model for the marks. Letting the marks be denoted by $z_i$, we use the following formula to represent $z_i$:
303-
304-
$$z_i = \alpha + \beta \lambda_i + \epsilon_i $$
305-
Equivalently, $$z_i \sim N(\alpha + \beta \lambda_i, \sigma_\epsilon^2)$$
306-
where $\sigma_\epsilon^2 = Var(\epsilon_i)$.
307-
308-
This equation states that the distribution of the marks is a linear function of the intensity field since we assume a normal likelihood for $\epsilon$. It's essentially a simple linear regression of the marks on the intensity field; $\alpha$ is the intercept and $\beta$ is the slope. Then, standard priors are used for $\epsilon, \alpha, \beta$. The point of this model is to figure out whether or not the growth of the anemones is correlated with their occurrence. If we find that $\beta$ is negative, then that might hint that locations with more numerous anemones happen to also have smaller anemones and that competition for food may keep close neighbors small.
309-
310-
```{code-cell} ipython3
311-
with mark_model:
312-
alpha = pm.Normal("alpha", sigma=10.0)
313-
beta = pm.Normal("beta", sigma=5)
314-
eps_sd = pm.HalfCauchy("eps_sd", beta=1.0)
315-
316-
marks = pm.Normal(
317-
"marks",
318-
mu=alpha + beta * intensity[n_centroids::],
319-
sigma=eps_sd,
320-
shape=n,
321-
observed=data["marks"].values,
322-
)
323-
```
324-
325-
It is important to note that this model takes much longer to run because using MCMC with Gaussian processes as implemented here requires a covariance matrix inversion with cubic time complexity in the number of spatial points. Since this GP is fit over 357 spatial points instead of ~120 points as in the previous model, it could take roughly $3^3=27$ times as long. Inference could be sped up dramatically using a sparse Gaussian process if faster sampling is required.
326-
327-
```{code-cell} ipython3
328-
with mark_model:
329-
trace = pm.sample(target_accept=0.95, return_inferencedata=True)
330-
```
331-
332-
The posterior summaries indicate that most of the probability mass for $\beta$ is on negative values. This gives us a high posterior probability that the intensity field (i.e. the number of anemones) is anti-correlated with the size of each anemone!
333-
334-
```{code-cell} ipython3
335-
az.summary(trace, var_names=["alpha", "beta"])
336-
```
337-
338-
* This notebook was written by [Christopher Krapu](https://github.com/ckrapu) on September 6, 2020 and updated on April 1, 2021.
287+
## Watermark
339288

340289
```{code-cell} ipython3
341290
%load_ext watermark
342291
%watermark -n -u -v -iv -w
343292
```
293+
294+
:::{include} ../page_footer.md
295+
:::

0 commit comments

Comments
 (0)