Skip to content

Commit 9b825b6

Browse files
authored
Internal architecture documentation (#450)
Closes #204
1 parent 36a35c1 commit 9b825b6

File tree

3 files changed

+123
-1
lines changed

3 files changed

+123
-1
lines changed
File renamed without changes.

docs/how-it-works/internals.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Internal Architecture
2+
3+
> Last edited April 2024.
4+
5+
This document will outline Lonboard's internal architecture, and hopefully should be useful for potential contributors.
6+
7+
## Pydeck
8+
9+
Before we consider Lonboard, we should briefly cover how Pydeck manages data transit.
10+
11+
<!-- It's impossible to consider Lonboard's architecture in isolation because many of Lonboard's architectural decisions were chosen in explicit contrast to pydeck. Lonboard and pydeck both bind to deck.gl, but have significantly different APIs and characteristics. -->
12+
13+
Pydeck uses the loaders.gl ecosystem natively supported in deck.gl. This means that pydeck passes the user input directly into the `data` prop of any given layer. When `data` is a string, it's interpreted as a URL, passed into loaders.gl machinery to parse the given data format to a list of GeoJSON features. In order to do data-driven styling, users pass in strings like `"@@=properties.valuePerSqm"` to describe the JavaScript callback that should be applied to each GeoJSON feature.
14+
15+
In my opinion, this presents performance pitfalls and gives Pydeck a stifling amount of flexibility. The Python API isn't able to do much of anything because the user input could be _anything_ and users end up writing a sort of JavaScript in Python strings.
16+
17+
## Lonboard
18+
19+
### Goals and target user persona
20+
21+
While I was never a primary pydeck developer and can't speak to their goals, I can say that Lonboard has a well-formed idea of the target user.
22+
23+
When I was learning geospatial data processing in Python, I learned that GeoPandas, Shapely, and others were amazing tools. But I would perform operations and not have a _deep understanding_ of what the output looked like. I wanted some way to quickly get a picture of what my data looked like. This drove me to build [`keplergl_cli`](https://github.com/kylebarron/keplergl_cli), a single function or CLI command to visualize data in an interactive HTML map. The desire for this simplicity lives on in Lonboard's `viz` function.
24+
25+
This experience informs the target user of Lonboard:
26+
27+
- A Python user who does not know or want to know JavaScript.
28+
- Wants fast exploratory data analysis.
29+
- Already working with data in Python.
30+
- A reasonably fast internet connection on a desktop computer.
31+
32+
### Python classes
33+
34+
The `Map` class is the primary and only widget associated with JavaScript that renders anything. `Map` subclasses from `anywidget.AnyWidget`, which dynamically fetches Lonboard's JavaScript ESM bundle. The `Map` class emulates deck.gl's `Deck` class. It synchronizes map state and can be passed a sequence of `Layer` objects.
35+
36+
`Layer` classes are designed to map to each underlying deck.gl layer. Most `Layer` classes subclass from `BaseArrowLayer`, which stores a `pyarrow.Table` in the dataclass. This data object gets serialized to Parquet and rendered in a layer provided by `@geoarrow/deck.gl-layers` as described later on. The primary API for each of these Arrow-based layers is `from_geopandas`, which converts a `GeoDataFrame` to a `pyarrow.Table` with a [GeoArrow](https://geoarrow.org/)-represented geometry column. A few layers are not Arrow-based and do not subclass from `BaseArrowLayer`: primarily `BitmapLayer` and `BitmapTileLayer`. These pass user data input directly into a core deck.gl layer.
37+
38+
These `Layer` classes subclass from `ipywidgets.Widget` but are _not_ associated with any JavaScript of their own. But by being widgets themselves, their data is synced with JavaScript and Python data changes are propagated as events to the core `Map` object.
39+
40+
Each layer has parameters that map to deck.gl accessors and render properties. Each of these has validation on the Python side to catch invalid input early. Accessors are modeled with ["trait" objects](https://developmentseed.org/lonboard/latest/api/traits/). Scalar accessors are stored as a scalar int/string/float. Function accessors are evaluated in Python and stored as a `pyarrow.Array`.
41+
42+
A variety of [layer extensions](https://developmentseed.org/lonboard/latest/api/layer-extensions/) are supported, which map to deck.gl's upstream layer extensions. These layer extensions dynamically add more parameters to the layers on which they're assigned.
43+
44+
45+
### Data loading
46+
47+
While pydeck allows users to pass in URLs to remote files, at the moment this is not possible in Lonboard. With only a few exceptions (e.g. the `BitmapLayer` and `BitmapTileLayer`), _only_ data originating in Python is supported in Lonboard. This reduction in scope vastly simplifies Lonboard's internals while not providing much of a hurdle for Lonboard's core audience.
48+
49+
Jupyter provides a WebSocket connection between Python and JavaScript. Jupyter widget state from Python classes is serialized over this WebSocket and received in JavaScript. While most widgets synchronize objects through JSON, widgets fully support binary data as well, which Lonboard utilizes.
50+
51+
#### Simplicity of code
52+
53+
Supporting only a single data provenance and format makes the internals of Lonboard much more maintainable. Lonboard's JavaScript bindings only need to include a single data parser and can optimize performance for Parquet and Arrow data.
54+
55+
#### Parquet
56+
57+
Parquet was chosen as the underlying data transmission format for its excellent compression, fast read and write speeds, compatibility with GeoArrow and, in turn, compatibility with deck.gl's binary API.
58+
59+
I tested with data from the speed test example (3 million points). Saving GeoArrow to Parquet was 135x faster than saving a GeoDataFrame to GeoJSON. The resulting file was 26x smaller. And parsing the Parquet to Arrow on the frontend was 5.6x faster than `JSON.parse` (including moving the Arrow from WebAssembly to JS).
60+
61+
Note that what is sent over the wire is _not_ GeoParquet. GeoParquet 1.0 requires special metadata and WKB-encoded geometries. Using that would require code on the frontend to parse WKB-encoded geometries into GeoArrow. The upcoming GeoParquet 1.1 includes support for GeoArrow but only its "separated" coordinates, where `x` and `y` coordinate values are in two different buffers (i.e. `xxxx`, `yyyy`). deck.gl's binary API currently does not support passing separated coordinates. Therefore Lonboard uses the "interleaved" coordinate variant, where `x` and `y` coordinates are stored in the same buffer (i.e. `xyxyxyxy`).
62+
63+
64+
#### Bootstrapping GeoArrow
65+
66+
For performance reasons, Lonboard is very focused on using GeoArrow throughout the pipeline in both Python and JavaScript. At the time when Lonboard was started, there were no user-friendly tools for converting data to GeoArrow in client-side JavaScript. By supporting only data from Python and only Parquet data, we could focus on making data conversions from GeoPandas to GeoArrow in Python efficient and stable. This enables GeoArrow to be "just an implementation detail" and out of any primary public APIs while also working to bootstrap a GeoArrow ecosystem in Python and JavaScript.
67+
68+
#### Authentication
69+
70+
Since all data is loaded via Python over the Jupyter-provided WebSocket, Lonboard doesn't need to implement any sort of authentication itself. The user is able to use whatever credentials they're familiar with, say for S3, to access data. Those credentials never need to be passed to the browser.
71+
72+
In contrast, having the JavaScript side load data directly from a URL requires some way for users to specify authorization headers or API keys for access to private datasets.
73+
74+
### Layer Creation
75+
76+
With the exception of the `BitmapLayer`, no _direct_ deck.gl layers are used. deck.gl layers are used _exclusively_ through the [`@geoarrow/deck.gl-layers`](https://github.com/geoarrow/deck.gl-layers) glue library. This library connects GeoArrow data to the low-level binary data API of each deck.gl layer. **All** accessors are passed in directly as binary buffers.
77+
78+
This means that in JavaScript the data _never_ leaves an Arrow binary representation. It's parsed from Parquet to Arrow in WebAssembly, the Arrow buffers are copied out of WebAssembly memory, but never parsed to JSON strings or JavaScript objects.
79+
80+
### Accessors and data-driven rendering
81+
82+
Loading data from Python also helps to simplify data-driven rendering.
83+
84+
#### No function serialization
85+
86+
If we can make the data movement between Python and JavaScript fast enough, we can serialize _computed rendering properties_ rather than the function to generate them. Lonboard serializes data, not functions.
87+
88+
Any accessor used to style data is computed in Python, stored as a binary array, serialized to Parquet and sent to the frontend. Other than parsing the input Parquet buffers to Arrow, no data transformations or modifications are done on the JavaScript side.
89+
90+
In Pydeck users had to pass something like
91+
92+
```py
93+
getRadius="@@=properties.valuePerSqm"
94+
```
95+
96+
which would be expanded into
97+
98+
```js
99+
getRadius: (object) => object.properties.valuePerSqm
100+
```
101+
102+
on the JavaScript side.
103+
104+
I find this syntax to be horrible UX. The user is no longer writing Python, they need to learn a new DSL just for this.
105+
106+
RGB Colors for 3 million points take up 9MiB in memory (or 12MiB for RGBA). In the example of speed test data, 12MiB of RGBA data compressed down to 4MiB for transit. Float32 radii values for 3 million points take up 12MiB in memory and 9MiB compressed.
107+
108+
Even for users on remote Python environments, this takes no more than a couple seconds to download, and then can render almost instantly. For non-point datasets with a fewer number of geometries, the byte size of accessors are even smaller relative to the geometry buffers.
109+
110+
#### Update styling without re-sending data
111+
112+
With Lonboard's widget architecture, any accessors other than the geometry are stored separately to the main Arrow table. This means that all accessors and rendering properties can be updated in isolation. The core table does not need to be synced again when a property is updated.
113+
114+
#### Uses familiar objects
115+
116+
Accessors are defined in terms of Numpy arrays, Pandas Series or PyArrow Arrays, which are familiar to Python data scientists.
117+
118+
#### Full access to Python
119+
120+
Because we serialize _data_ to JavaScript instead of a function in a custom DSL, the user has the full spectrum of Python available to them. They can use one of matplotlib's many [normalization helpers](https://matplotlib.org/stable/api/_as_gen/matplotlib.colors.Normalize.html) before [applying a colormap](https://developmentseed.org/lonboard/latest/api/colormap/#lonboard.colormap.apply_continuous_cmap). Or they can use apply some ML model and use its outputs for point radii.

mkdocs.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,9 @@ nav:
7373
- Performance: performance.md
7474
- Changelog: CHANGELOG.md
7575
- Alternatives: alternatives.md
76-
- "How it works?": how-it-works.md
76+
- "How it works?":
77+
- how-it-works/index.md
78+
- how-it-works/internals.md
7779

7880
watch:
7981
- lonboard

0 commit comments

Comments
 (0)