Skip to content

Commit 4d73c34

Browse files
committed
Further expansion of guide
1 parent 16bd351 commit 4d73c34

File tree

6 files changed

+193
-7
lines changed

6 files changed

+193
-7
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"label": "Data exploration",
3+
"collapsible": true,
4+
"collapsed": true,
5+
}

docs/use-cases/AI_ML/jupyter-notebook.md renamed to docs/use-cases/AI_ML/data-exploration/jupyter-notebook.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,11 @@ In this guide, you will learn how you can explore a dataset on ClickHouse Cloud
2626
- a virtual environment
2727
- a working ClickHouse Cloud service and your [connection details](/cloud/guides/sql-console/gather-connection-details)
2828

29+
:::tip
30+
If you don't yet have a ClickHouse Cloud account, you can [sign up](console.clickhouse.cloud/signUp?loc=docs-juypter-chdb) for
31+
a trial and get $300 in free-credits to begin.
32+
:::
33+
2934
**What you'll learn:**
3035
- Connect to ClickHouse Cloud from Jupyter notebooks using chDB
3136
- Query remote datasets and convert results to Pandas DataFrames

docs/use-cases/AI_ML/marimo-notebook.md renamed to docs/use-cases/AI_ML/data-exploration/marimo-notebook.md

Lines changed: 183 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
slug: /use-cases/AI/jupyter-notebook
2+
slug: /use-cases/AI/marimo-notebook
33
sidebar_label: 'Exploring data with Marimo notebooks and chDB'
44
title: 'Exploring data with Marimo notebooks and chDB'
55
description: 'This guide explains how to setup and use chDB to explore data from ClickHouse Cloud or local files in Marimo notebooks'
@@ -13,6 +13,9 @@ import image_2 from '@site/static/images/use-cases/AI_ML/jupyter/2.png';
1313
import image_3 from '@site/static/images/use-cases/AI_ML/jupyter/3.png';
1414
import image_4 from '@site/static/images/use-cases/AI_ML/Marimo/4.png';
1515
import image_5 from '@site/static/images/use-cases/AI_ML/Marimo/5.png';
16+
import image_6 from '@site/static/images/use-cases/AI_ML/Marimo/6.png';
17+
import image_7 from '@site/static/images/use-cases/AI_ML/Marimo/7.gif';
18+
import image_8 from '@site/static/images/use-cases/AI_ML/Marimo/8.gif';
1619

1720
In this guide, you will learn how you can explore a dataset on ClickHouse Cloud data in Marimo notebook with the help of [chDB](/docs/chdb) - a fast in-process SQL OLAP Engine powered by ClickHouse.
1821

@@ -21,11 +24,15 @@ In this guide, you will learn how you can explore a dataset on ClickHouse Cloud
2124
- a virtual environment
2225
- a working ClickHouse Cloud service and your [connection details](/docs/cloud/guides/sql-console/gather-connection-details)
2326

27+
:::tip
28+
If you don't yet have a ClickHouse Cloud account, you can [sign up](console.clickhouse.cloud/signUp?loc=docs-marimo-chdb) for
29+
a trial and get $300 in free-credits to begin.
30+
:::
31+
2432
**What you'll learn:**
2533
- Connect to ClickHouse Cloud from Marimo notebooks using chDB
2634
- Query remote datasets and convert results to Pandas DataFrames
27-
- Combine cloud data with local CSV files for analysis
28-
- Visualize data using Plotly in Marimo's reactive environment
35+
- Visualize data using Plotly in Marimo
2936
- Leverage Marimo's reactive execution model for interactive data exploration
3037

3138
We'll be using the UK Property Price dataset which is available on ClickHouse Cloud as one of the starter datasets.
@@ -58,8 +65,8 @@ From a terminal run the following command to add your username and password as e
5865

5966
```bash
6067
export CLICKHOUSE_CLOUD_HOSTNAME=<HOSTNAME>
61-
export CLICKHOUSE_USER=default
62-
export CLICKHOUSE_PASSWORD=your_actual_password
68+
export CLICKHOUSE_CLOUD_USER=default
69+
export CLICKHOUSE_CLOUD_PASSWORD=your_actual_password
6370
```
6471

6572
:::note
@@ -125,7 +132,7 @@ ClickHouse's [remoteSecure](/docs/sql-reference/table-functions/remote) function
125132

126133
You can instruct chDB to return this data in process as a Pandas data frame - which is a convenient and familiar way of working with data.
127134

128-
### Querying ClickHouse Cloud data
135+
### Querying ClickHouse Cloud data {#querying-clickhouse-cloud-data}
129136

130137
Create a new cell with the following query to fetch the UK price paid data from your ClickHouse Cloud service and turn it into a `pandas.DataFrame`:
131138

@@ -168,4 +175,173 @@ Depending on the size of your data, this could take a few seconds.
168175

169176
In this case we return an average price point per year, and filter by `town='LONDON'`.
170177

171-
The result is then stored as a DataFrame in a variable called `df`.
178+
The result is then stored as a DataFrame in a variable called `df`.
179+
180+
### Visualizing the data {#visualizing-the-data}
181+
182+
With the data now available to us in a familiar form, let's explore how prices of property in London have changed with time.
183+
184+
Marimo works particularly well with interactive plotting libraries like Plotly.
185+
In a new cell, create an interactive chart:
186+
187+
```python
188+
fig = px.line(
189+
df,
190+
x='year',
191+
y='price',
192+
title='Average Property Prices in London Over Time',
193+
labels={'price': 'Average Price (£)', 'year': 'Year'}
194+
)
195+
196+
fig.update_traces(mode='lines+markers')
197+
fig.update_layout(hovermode='x unified')
198+
fig
199+
```
200+
201+
Perhaps unsurprisingly, property prices in London have increased substantially over time.
202+
203+
<Image size="md" img={image_6} alt="Marimo data visualization"/>
204+
205+
One of Marimo's strengths is its reactive execution model. Let's create an interactive widget to select different towns dynamically.
206+
207+
### Interactive town selection {#interactive-town-selection}
208+
209+
In a new cell, create a dropdown to select different towns:
210+
211+
```python
212+
town_selector = mo.ui.dropdown(
213+
options=['LONDON', 'MANCHESTER', 'BIRMINGHAM', 'LEEDS', 'LIVERPOOL'],
214+
value='LONDON',
215+
label='Select a town:'
216+
)
217+
town_selector
218+
```
219+
220+
In another cell, create a query that reacts to the town selection. When you change the dropdown, this cell will automatically re-execute:
221+
222+
```python
223+
query_reactive = f"""
224+
SELECT
225+
toYear(date) AS year,
226+
round(avg(price)) AS price
227+
FROM remoteSecure(
228+
'{os.environ.get("CLICKHOUSE_CLOUD_HOSTNAME")}',
229+
'default.pp_complete',
230+
'{os.environ.get("CLICKHOUSE_CLOUD_USER")}',
231+
'{os.environ.get("CLICKHOUSE_CLOUD_PASSWORD")}'
232+
)
233+
WHERE town = '{town_selector.value}'
234+
GROUP BY year
235+
ORDER BY year
236+
"""
237+
238+
df_reactive = chdb.query(query_reactive, "DataFrame")
239+
df_reactive
240+
```
241+
242+
Now create a chart that updates automatically when you change the town.
243+
You can move the chart above the dynamic dataframe so that it appears
244+
below the cell with the dropdown.
245+
246+
```python
247+
fig_reactive = px.line(
248+
df_reactive,
249+
x='year',
250+
y='price',
251+
title=f'Average Property Prices in {town_selector.value} Over Time',
252+
labels={'price': 'Average Price (£)', 'year': 'Year'}
253+
)
254+
255+
fig_reactive.update_traces(mode='lines+markers')
256+
fig_reactive.update_layout(hovermode='x unified')
257+
fig_reactive
258+
```
259+
260+
Now when you select a town from the drop-down the chart will update dynamically:
261+
262+
<Image size="md" img={image_7} alt="Marimo dynamic chart"/>
263+
264+
### Exploring price distributions with interactive box plots {#exploring-price-distributions}
265+
266+
Let's dive deeper into the data by examining the distribution of property prices in London for different years.
267+
A box and whisker plot will show us the median, quartiles, and outliers, giving us a much better understanding than just the average price.
268+
First, let's create a year slider that will let us interactively explore different years:
269+
270+
In a new cell, add the following:
271+
272+
```python
273+
year_slider = mo.ui.slider(
274+
start=1995,
275+
stop=2024,
276+
value=2020,
277+
step=1,
278+
label='Select Year:',
279+
show_value=True
280+
)
281+
year_slider
282+
```
283+
284+
Now, let's query the individual property prices for the selected year.
285+
Note that we're not aggregating here - we want all the individual transactions to build our distribution:
286+
287+
```python
288+
query_distribution = f"""
289+
SELECT
290+
price,
291+
toYear(date) AS year
292+
FROM remoteSecure(
293+
'{os.environ.get("CLICKHOUSE_CLOUD_HOSTNAME")}',
294+
'default.pp_complete',
295+
'{os.environ.get("CLICKHOUSE_CLOUD_USER")}',
296+
'{os.environ.get("CLICKHOUSE_CLOUD_PASSWORD")}'
297+
)
298+
WHERE town = 'LONDON'
299+
AND toYear(date) = {year_slider.value}
300+
AND price > 0
301+
AND price < 5000000
302+
"""
303+
304+
df_distribution = chdb.query(query_distribution, "DataFrame")
305+
306+
# create an interactive box plot.
307+
fig_box = go.Figure()
308+
309+
fig_box.add_trace(
310+
go.Box(
311+
y=df_distribution['price'],
312+
name=f'London {year_slider.value}',
313+
boxmean='sd', # Show mean and standard deviation
314+
marker_color='lightblue',
315+
boxpoints='outliers' # Show outlier points
316+
)
317+
)
318+
319+
fig_box.update_layout(
320+
title=f'Distribution of Property Prices in London ({year_slider.value})',
321+
yaxis=dict(
322+
title='Price (£)',
323+
tickformat=',.0f'
324+
),
325+
showlegend=False,
326+
height=600
327+
)
328+
329+
fig_box
330+
```
331+
If you select the options button in the top right hand of the cell, you can hide
332+
the code.
333+
As you move the slider, the plot will automatically update thanks to Marimo's reactive execution:
334+
335+
<Image size="md" img={image_8} alt="Marimo dynamic chart"/>
336+
337+
## Summary {#summary}
338+
339+
This guide demonstrated how you can use chDB to explore your data in ClickHouse Cloud using Marimo notebooks.
340+
Using the UK Property Price dataset, we showed how to query remote ClickHouse Cloud data with the `remoteSecure()` function, and convert results directly to Pandas DataFrames for analysis and visualization.
341+
Through chDB and Marimo's reactive execution model, data scientists can leverage ClickHouse's powerful SQL capabilities alongside familiar Python tools like Pandas and Plotly, with the added benefit of interactive widgets and automatic dependency tracking that make exploratory analysis more efficient and reproducible.
342+
343+
344+
345+
346+
347+
78 KB
Loading
7.5 MB
Loading
8.05 MB
Loading

0 commit comments

Comments
 (0)