Skip to content

Commit 59d620b

Browse files
authored
Merge pull request #4543 from ClickHouse/marimo_notebook_guide
AI/ML: Marimo notebook, chdb and Cloud integration guide
2 parents 6c4c4e8 + 9cfa57e commit 59d620b

File tree

9 files changed

+350
-1
lines changed

9 files changed

+350
-1
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"label": "Data exploration",
3+
"collapsible": true,
4+
"collapsed": true,
5+
}

docs/use-cases/AI_ML/jupyter-notebook.md renamed to docs/use-cases/AI_ML/data-exploration/jupyter-notebook.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
slug: /use-cases/AI/jupyter-notebook
3-
sidebar_label: 'Exploring data in Jupyter notebooks with chDB'
3+
sidebar_label: 'Exploring data with Jupyter notebooks and chDB'
44
title: 'Exploring data in Jupyter notebooks with chDB'
55
description: 'This guide explains how to setup and use chDB to explore data from ClickHouse Cloud or local files in Jupyer notebooks'
66
keywords: ['ML', 'Jupyer', 'chDB', 'pandas']
@@ -26,6 +26,11 @@ In this guide, you will learn how you can explore a dataset on ClickHouse Cloud
2626
- a virtual environment
2727
- a working ClickHouse Cloud service and your [connection details](/cloud/guides/sql-console/gather-connection-details)
2828

29+
:::tip
30+
If you don't yet have a ClickHouse Cloud account, you can [sign up](https://console.clickhouse.cloud/signUp?loc=docs-juypter-chdb) for
31+
a trial and get $300 in free-credits to begin.
32+
:::
33+
2934
**What you'll learn:**
3035
- Connect to ClickHouse Cloud from Jupyter notebooks using chDB
3136
- Query remote datasets and convert results to Pandas DataFrames
Lines changed: 336 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,336 @@
1+
---
2+
slug: /use-cases/AI/marimo-notebook
3+
sidebar_label: 'Exploring data with Marimo notebooks and chDB'
4+
title: 'Exploring data with Marimo notebooks and chDB'
5+
description: 'This guide explains how to setup and use chDB to explore data from ClickHouse Cloud or local files in Marimo notebooks'
6+
keywords: ['ML', 'Marimo', 'chDB', 'pandas']
7+
doc_type: 'guide'
8+
---
9+
10+
import Image from '@theme/IdealImage';
11+
import image_1 from '@site/static/images/use-cases/AI_ML/jupyter/1.png';
12+
import image_2 from '@site/static/images/use-cases/AI_ML/jupyter/2.png';
13+
import image_3 from '@site/static/images/use-cases/AI_ML/jupyter/3.png';
14+
import image_4 from '@site/static/images/use-cases/AI_ML/Marimo/4.png';
15+
import image_5 from '@site/static/images/use-cases/AI_ML/Marimo/5.png';
16+
import image_6 from '@site/static/images/use-cases/AI_ML/Marimo/6.png';
17+
import image_7 from '@site/static/images/use-cases/AI_ML/Marimo/7.gif';
18+
import image_8 from '@site/static/images/use-cases/AI_ML/Marimo/8.gif';
19+
20+
In this guide, you will learn how you can explore a dataset on ClickHouse Cloud data in Marimo notebook with the help of [chDB](/docs/chdb) - a fast in-process SQL OLAP Engine powered by ClickHouse.
21+
22+
**Prerequisites:**
23+
- Python 3.8 or higher
24+
- a virtual environment
25+
- a working ClickHouse Cloud service and your [connection details](/docs/cloud/guides/sql-console/gather-connection-details)
26+
27+
:::tip
28+
If you don't yet have a ClickHouse Cloud account, you can [sign up](https://console.clickhouse.cloud/signUp?loc=docs-marimo-chdb) for
29+
a trial and get $300 in free-credits to begin.
30+
:::
31+
32+
**What you'll learn:**
33+
- Connect to ClickHouse Cloud from Marimo notebooks using chDB
34+
- Query remote datasets and convert results to Pandas DataFrames
35+
- Visualize data using Plotly in Marimo
36+
- Leverage Marimo's reactive execution model for interactive data exploration
37+
38+
We'll be using the UK Property Price dataset which is available on ClickHouse Cloud as one of the starter datasets.
39+
It contains data about the prices that houses were sold for in the United Kingdom from 1995 to 2024.
40+
41+
## Setup {#setup}
42+
43+
### Loading the dataset {#loading-the-dataset}
44+
45+
To add this dataset to an existing ClickHouse Cloud service, login to [console.clickhouse.cloud](https://console.clickhouse.cloud/) with your account details.
46+
47+
In the left hand menu, click on `Data sources`. Then click `Predefined sample data`:
48+
49+
<Image size="md" img={image_1} alt="Add example data set"/>
50+
51+
Select `Get started` in the UK property price paid data (4GB) card:
52+
53+
<Image size="md" img={image_2} alt="Select UK price paid dataset"/>
54+
55+
Then click `Import dataset`:
56+
57+
<Image size="md" img={image_3} alt="Import UK price paid dataset"/>
58+
59+
ClickHouse will automatically create the `pp_complete` table in the `default` database and fill the table with 28.92 million rows of price point data.
60+
61+
In order to reduce the likelihood of exposing your credentials, we recommend you add your Cloud username and password as environment variables on your local machine.
62+
From a terminal run the following command to add your username and password as environment variables:
63+
64+
### Setting up credentials {#setting-up-credentials}
65+
66+
```bash
67+
export CLICKHOUSE_CLOUD_HOSTNAME=<HOSTNAME>
68+
export CLICKHOUSE_CLOUD_USER=default
69+
export CLICKHOUSE_CLOUD_PASSWORD=your_actual_password
70+
```
71+
72+
:::note
73+
The environment variables above persist only as long as your terminal session.
74+
To set them permanently, add them to your shell configuration file.
75+
:::
76+
77+
### Installing Marimo {#installing-marimo}
78+
79+
Now activate your virtual environment.
80+
From within your virtual environment, install the following packages that we will be using in this guide:
81+
82+
```python
83+
pip install chdb pandas plotly marimo
84+
```
85+
86+
Create a new Marimo notebook with the following command:
87+
88+
```bash
89+
marimo edit clickhouse_exploration.py
90+
```
91+
92+
A new browser window should open with the Marimo interface on localhost:2718:
93+
94+
<Image size="md" img={image_4} alt="Marimo interface"/>
95+
96+
Marimo notebooks are stored as pure Python files, making them easy to version control and share with others.
97+
98+
## Installing dependencies {#installing-dependencies}
99+
100+
In a new cell, import the required packages:
101+
102+
```python
103+
import marimo as mo
104+
import chdb
105+
import pandas as pd
106+
import os
107+
import plotly.express as px
108+
import plotly.graph_objects as go
109+
```
110+
111+
If you hover your mouse over the cell you will see two circles with the "+" symbol appear.
112+
You can click these to add new cells.
113+
114+
Add a new cell and run a simple query to check that everything is set up correctly:
115+
116+
```python
117+
result = chdb.query("SELECT 'Hello ClickHouse from Marimo!'", "DataFrame")
118+
result
119+
```
120+
121+
You should see the result shown underneath the cell you just ran:
122+
123+
<Image size="md" img={image_5} alt="Marimo hello world"/>
124+
125+
## Exploring the data {#exploring-the-data}
126+
127+
With the UK price paid data set up and chDB up and running in a Marimo notebook, we can now get started exploring our data.
128+
Let's imagine we are interested in checking how price has changed with time for a specific area in the UK such as the capital city, London.
129+
ClickHouse's [`remoteSecure`](/docs/sql-reference/table-functions/remote) function allows you to easily retrieve the data from ClickHouse Cloud.
130+
You can instruct chDB to return this data in process as a Pandas data frame - which is a convenient and familiar way of working with data.
131+
132+
### Querying ClickHouse Cloud data {#querying-clickhouse-cloud-data}
133+
134+
Create a new cell with the following query to fetch the UK price paid data from your ClickHouse Cloud service and turn it into a `pandas.DataFrame`:
135+
136+
```python
137+
query = f"""
138+
SELECT
139+
toYear(date) AS year,
140+
round(avg(price)) AS price,
141+
bar(price, 0, 1000000, 80)
142+
FROM remoteSecure(
143+
'{os.environ.get("CLICKHOUSE_CLOUD_HOSTNAME")}',
144+
'default.pp_complete',
145+
'{os.environ.get("CLICKHOUSE_CLOUD_USER")}',
146+
'{os.environ.get("CLICKHOUSE_CLOUD_PASSWORD")}'
147+
)
148+
WHERE town = 'LONDON'
149+
GROUP BY year
150+
ORDER BY year
151+
"""
152+
153+
df = chdb.query(query, "DataFrame")
154+
df.head()
155+
```
156+
157+
In the snippet above, `chdb.query(query, "DataFrame")` runs the specified query and outputs the result as a Pandas DataFrame.
158+
159+
In the query we are using the [`remoteSecure`](/sql-reference/table-functions/remote) function to connect to ClickHouse Cloud.
160+
161+
The `remoteSecure` functions takes as parameters:
162+
- a connection string
163+
- the name of the database and table to use
164+
- your username
165+
- your password
166+
167+
As a security best practice, you should prefer using environment variables for the username and password parameters rather than specifying them directly in the function, although this is possible if you wish.
168+
169+
The `remoteSecure` function connects to the remote ClickHouse Cloud service, runs the query and returns the result.
170+
Depending on the size of your data, this could take a few seconds.
171+
172+
In this case we return an average price point per year, and filter by `town='LONDON'`.
173+
The result is then stored as a DataFrame in a variable called `df`.
174+
175+
### Visualizing the data {#visualizing-the-data}
176+
177+
With the data now available to us in a familiar form, let's explore how prices of property in London have changed with time.
178+
179+
Marimo works particularly well with interactive plotting libraries like Plotly.
180+
In a new cell, create an interactive chart:
181+
182+
```python
183+
fig = px.line(
184+
df,
185+
x='year',
186+
y='price',
187+
title='Average Property Prices in London Over Time',
188+
labels={'price': 'Average Price (£)', 'year': 'Year'}
189+
)
190+
191+
fig.update_traces(mode='lines+markers')
192+
fig.update_layout(hovermode='x unified')
193+
fig
194+
```
195+
196+
Perhaps unsurprisingly, property prices in London have increased substantially over time.
197+
198+
<Image size="md" img={image_6} alt="Marimo data visualization"/>
199+
200+
One of Marimo's strengths is its reactive execution model. Let's create an interactive widget to select different towns dynamically.
201+
202+
### Interactive town selection {#interactive-town-selection}
203+
204+
In a new cell, create a dropdown to select different towns:
205+
206+
```python
207+
town_selector = mo.ui.dropdown(
208+
options=['LONDON', 'MANCHESTER', 'BIRMINGHAM', 'LEEDS', 'LIVERPOOL'],
209+
value='LONDON',
210+
label='Select a town:'
211+
)
212+
town_selector
213+
```
214+
215+
In another cell, create a query that reacts to the town selection. When you change the dropdown, this cell will automatically re-execute:
216+
217+
```python
218+
query_reactive = f"""
219+
SELECT
220+
toYear(date) AS year,
221+
round(avg(price)) AS price
222+
FROM remoteSecure(
223+
'{os.environ.get("CLICKHOUSE_CLOUD_HOSTNAME")}',
224+
'default.pp_complete',
225+
'{os.environ.get("CLICKHOUSE_CLOUD_USER")}',
226+
'{os.environ.get("CLICKHOUSE_CLOUD_PASSWORD")}'
227+
)
228+
WHERE town = '{town_selector.value}'
229+
GROUP BY year
230+
ORDER BY year
231+
"""
232+
233+
df_reactive = chdb.query(query_reactive, "DataFrame")
234+
df_reactive
235+
```
236+
237+
Now create a chart that updates automatically when you change the town.
238+
You can move the chart above the dynamic dataframe so that it appears
239+
below the cell with the dropdown.
240+
241+
```python
242+
fig_reactive = px.line(
243+
df_reactive,
244+
x='year',
245+
y='price',
246+
title=f'Average Property Prices in {town_selector.value} Over Time',
247+
labels={'price': 'Average Price (£)', 'year': 'Year'}
248+
)
249+
250+
fig_reactive.update_traces(mode='lines+markers')
251+
fig_reactive.update_layout(hovermode='x unified')
252+
fig_reactive
253+
```
254+
255+
Now when you select a town from the drop-down the chart will update dynamically:
256+
257+
<Image size="md" img={image_7} alt="Marimo dynamic chart"/>
258+
259+
### Exploring price distributions with interactive box plots {#exploring-price-distributions}
260+
261+
Let's dive deeper into the data by examining the distribution of property prices in London for different years.
262+
A box and whisker plot will show us the median, quartiles, and outliers, giving us a much better understanding than just the average price.
263+
First, let's create a year slider that will let us interactively explore different years:
264+
265+
In a new cell, add the following:
266+
267+
```python
268+
year_slider = mo.ui.slider(
269+
start=1995,
270+
stop=2024,
271+
value=2020,
272+
step=1,
273+
label='Select Year:',
274+
show_value=True
275+
)
276+
year_slider
277+
```
278+
279+
Now, let's query the individual property prices for the selected year.
280+
Note that we're not aggregating here - we want all the individual transactions to build our distribution:
281+
282+
```python
283+
query_distribution = f"""
284+
SELECT
285+
price,
286+
toYear(date) AS year
287+
FROM remoteSecure(
288+
'{os.environ.get("CLICKHOUSE_CLOUD_HOSTNAME")}',
289+
'default.pp_complete',
290+
'{os.environ.get("CLICKHOUSE_CLOUD_USER")}',
291+
'{os.environ.get("CLICKHOUSE_CLOUD_PASSWORD")}'
292+
)
293+
WHERE town = 'LONDON'
294+
AND toYear(date) = {year_slider.value}
295+
AND price > 0
296+
AND price < 5000000
297+
"""
298+
299+
df_distribution = chdb.query(query_distribution, "DataFrame")
300+
301+
# create an interactive box plot.
302+
fig_box = go.Figure()
303+
304+
fig_box.add_trace(
305+
go.Box(
306+
y=df_distribution['price'],
307+
name=f'London {year_slider.value}',
308+
boxmean='sd', # Show mean and standard deviation
309+
marker_color='lightblue',
310+
boxpoints='outliers' # Show outlier points
311+
)
312+
)
313+
314+
fig_box.update_layout(
315+
title=f'Distribution of Property Prices in London ({year_slider.value})',
316+
yaxis=dict(
317+
title='Price (£)',
318+
tickformat=',.0f'
319+
),
320+
showlegend=False,
321+
height=600
322+
)
323+
324+
fig_box
325+
```
326+
If you select the options button in the top right hand of the cell, you can hide
327+
the code.
328+
As you move the slider, the plot will automatically update thanks to Marimo's reactive execution:
329+
330+
<Image size="md" img={image_8} alt="Marimo dynamic chart"/>
331+
332+
## Summary {#summary}
333+
334+
This guide demonstrated how you can use chDB to explore your data in ClickHouse Cloud using Marimo notebooks.
335+
Using the UK Property Price dataset, we showed how to query remote ClickHouse Cloud data with the `remoteSecure()` function, and convert results directly to Pandas DataFrames for analysis and visualization.
336+
Through chDB and Marimo's reactive execution model, data scientists can leverage ClickHouse's powerful SQL capabilities alongside familiar Python tools like Pandas and Plotly, with the added benefit of interactive widgets and automatic dependency tracking that make exploratory analysis more efficient and reproducible.

scripts/aspell-dict-file.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1124,6 +1124,9 @@ keypair
11241124
detections
11251125
--docs/integrations/data-visualization/dot-and-clickhouse.md--
11261126
Hashboard
1127+
--docs/use-cases/AI_ML/data-exploration/marimo-notebook.md--
1128+
Plotly
1129+
quartiles
11271130
--docs/integrations/language-clients/python/advanced-usage.md--
11281131
AsyncClient
11291132
BinaryIO
483 KB
Loading
510 KB
Loading
78 KB
Loading
7.5 MB
Loading
8.05 MB
Loading

0 commit comments

Comments
 (0)