Skip to content

Commit 04019d6

Browse files
committed
second pass before requesting review
1 parent 7dafb87 commit 04019d6

File tree

1 file changed

+45
-29
lines changed

1 file changed

+45
-29
lines changed

docs/use-cases/AI_ML/jupyter-notebook.md

Lines changed: 45 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
slug: /use-cases/AI/jupyter-notebook
3-
sidebar_label: 'Exploring data in Jupyter notebooks with chdb'
4-
title: 'Exploring data in Jupyter notebooks with chdb'
5-
description: 'This guide explains how to setup and use chdb to explore data from ClickHouse Cloud or local files in Jupyer notebooks'
6-
keywords: ['ML', 'Jupyer', 'chdb', 'pandas']
3+
sidebar_label: 'Exploring data in Jupyter notebooks with chDB'
4+
title: 'Exploring data in Jupyter notebooks with chDB'
5+
description: 'This guide explains how to setup and use chDB to explore data from ClickHouse Cloud or local files in Jupyer notebooks'
6+
keywords: ['ML', 'Jupyer', 'chDB', 'pandas']
77
doc_type: 'guide'
88
---
99

@@ -18,14 +18,20 @@ import image_7 from '@site/static/images/use-cases/AI_ML/jupyter/7.png';
1818
import image_8 from '@site/static/images/use-cases/AI_ML/jupyter/8.png';
1919
import image_9 from '@site/static/images/use-cases/AI_ML/jupyter/9.png';
2020

21-
# Exploring data with Jupyter notebooks and chdb
21+
# Exploring data with Jupyter notebooks and chDB
2222

23-
In this guide, you will learn how you can explore a dataset on ClickHouse Cloud data in Jupyter notebook with the help of [chdb](/chdb) - a fast in-process SQL OLAP Engine powered by ClickHouse.
23+
In this guide, you will learn how you can explore a dataset on ClickHouse Cloud data in Jupyter notebook with the help of [chDB](/chdb) - a fast in-process SQL OLAP Engine powered by ClickHouse.
2424

25-
Pre-requisites:
25+
**Prerequisites**:
2626
- a virtual environment
2727
- a working ClickHouse Cloud service and your [connection details](/cloud/guides/sql-console/gather-connection-details)
2828

29+
**What you'll learn:**
30+
- Connect to ClickHouse Cloud from Jupyter notebooks using chDB
31+
- Query remote datasets and convert results to Pandas DataFrames
32+
- Combine cloud data with local CSV files for analysis
33+
- Visualize data using matplotlib
34+
2935
We'll be using the UK Property Price dataset which is available on ClickHouse Cloud as one of the starter datasets.
3036
It contains data about the prices that houses were sold for in the United Kingdom from 1995 to 2024.
3137

@@ -47,7 +53,7 @@ Then click `Import dataset`:
4753

4854
ClickHouse will automatically create the `pp_complete` table in the `default` database and fill the table with 28.92 million rows of price point data.
4955

50-
In order to reduce the likelihood of exposing your credentials, we recommend to add your Cloud username and password as environment variables.
56+
In order to reduce the likelihood of exposing your credentials, we recommend to add your Cloud username and password as environment variables on your local machine.
5157
From a terminal run the following command to add your username and password as environment variables:
5258

5359
```bash
@@ -57,7 +63,7 @@ export CLICKHOUSE_PASSWORD=your_actual_password
5763

5864
:::note
5965
The environment variables above persist only as long as your terminal session.
60-
To set them permanently, for Linux or MacOS you'll want to set these permanently.
66+
To set them permanently, add them to your shell configuration file.
6167
:::
6268

6369
Now activate your virtual environment.
@@ -67,7 +73,7 @@ From within your virtual environment, install Jupyter Notebook with the followin
6773
pip install notebook
6874
```
6975

70-
and launch Jupyter Notebook with the following command:
76+
launch Jupyter Notebook with the following command:
7177

7278
```python
7379
jupyter notebook
@@ -83,13 +89,13 @@ Select any Python kernel available to you, in this example we will select the `i
8389

8490
<Image size="md" img={image_5} alt="Select kernel"/>
8591

86-
In a blank cell, you can type the following command to install chdb which we will be using connect to our remote ClickHouse Cloud instance:
92+
In a blank cell, you can type the following command to install chDB which we will be using connect to our remote ClickHouse Cloud instance:
8793

8894
```python
8995
pip install chdb
9096
```
9197

92-
You can now import chdb and run a simple query to check that everything is set up correctly:
98+
You can now import chDB and run a simple query to check that everything is set up correctly:
9399

94100
```python
95101
import chdb
@@ -100,11 +106,11 @@ print(result)
100106

101107
## Exploring the data {#exploring-the-data}
102108

103-
With the UK price paid data set up and chdb up and running in a Jupyter notebook, we can now get started exploring our data.
109+
With the UK price paid data set up and chDB up and running in a Jupyter notebook, we can now get started exploring our data.
104110

105-
Let's imagine we are interested in checking how price has changed with time for a specific area in the UK, for instance: London.
106-
ClickHouse's [`remoteSecure`](/sql-reference/table-functions/remote) function allows us to easily retrieve the data from ClickHouse Cloud.
107-
We can instruct chdb to return it in process as a Pandas data frame - which is a convenient and familiar way of working with data.
111+
Let's imagine we are interested in checking how price has changed with time for a specific area in the UK such as the capital city, London.
112+
ClickHouse's [`remoteSecure`](/sql-reference/table-functions/remote) function allows you to easily retrieve the data from ClickHouse Cloud.
113+
You can instruct chDB to return this data in process as a Pandas data frame - which is a convenient and familiar way of working with data.
108114

109115
Write the following query to fetch the UK price paid data from your ClickHouse Cloud service and turn it into a `pandas.DataFrame`:
110116

@@ -127,7 +133,7 @@ SELECT
127133
toYear(date) AS year,
128134
avg(price) AS avg_price
129135
FROM remoteSecure(
130-
'ztztn4astx.europe-west4.gcp.clickhouse.cloud',
136+
'****.europe-west4.gcp.clickhouse.cloud',
131137
default.pp_complete,
132138
'{username}',
133139
'{password}'
@@ -141,17 +147,20 @@ df = chdb.query(query, "DataFrame")
141147
df.head()
142148
```
143149

144-
In the snippet above, `chdb.query(query, "DataFrame")` runs the specified query and outputs the result to the terminal as a pandas DataFrame.
150+
In the snippet above, `chdb.query(query, "DataFrame")` runs the specified query and outputs the result to the terminal as a Pandas DataFrame.
145151
In the query we are using the `remoteSecure` function to connect to ClickHouse Cloud.
146152
The `remoteSecure` functions takes as parameters:
147153
- a connection string
148154
- the name of the database and table to use
149-
- your user name
155+
- your username
150156
- your password
151157

152-
As a security best practice, you should should prefer using environment variables for the username and password parameters rather than specifying them directly in the function, although this is possible if you wish.
158+
As a security best practice, you should prefer using environment variables for the username and password parameters rather than specifying them directly in the function, although this is possible if you wish.
153159

154-
The `remoteSecure` function connects to the remote ClickHouse Cloud service, runs the query and returns the result. Depending on the size of your data this could take a few seconds. In this case we return an average price point per year, and filter by `town='LONDON'`. The result is then stored as a DataFrame in a variable called `df`.
160+
The `remoteSecure` function connects to the remote ClickHouse Cloud service, runs the query and returns the result.
161+
Depending on the size of your data, this could take a few seconds.
162+
In this case we return an average price point per year, and filter by `town='LONDON'`.
163+
The result is then stored as a DataFrame in a variable called `df`.
155164

156165
`df.head` displays only the first few rows of the returned data:
157166

@@ -170,7 +179,7 @@ dtype: object
170179
```
171180

172181
Notice that while `date` is of type `Date` in ClickHouse, in the resulting data frame it is of type `uint16`.
173-
chdb automatically infers the most appropriate type when returning the DataFrame.
182+
chDB automatically infers the most appropriate type when returning the DataFrame.
174183

175184
With the data now available to us in a familiar form, let's explore how prices of property in London have changed with time.
176185

@@ -194,10 +203,11 @@ plt.show()
194203

195204
<Image size="md" img={image_7} alt="dataframe preview"/>
196205

197-
Perhaps unsurprisingly, property prices in London have massively increased over time.
206+
Perhaps unsurprisingly, property prices in London have increased substantially over time.
198207

199-
A colleague has sent us a .csv file with additional housing related variables.
200-
Let's plot some of these against the housing prices and see if we can discover any interesting correlations.
208+
A fellow data scientist has sent us a .csv file with additional housing related variables and is curious how
209+
the number of houses sold in London has changed over time.
210+
Let's plot some of these against the housing prices and see if we can discover any correlation.
201211

202212
You can use the `file` table engine to read files directly on your local machine.
203213
In a new cell, run the following command to make a new DataFrame from the local .csv file.
@@ -207,7 +217,7 @@ query = f"""
207217
SELECT
208218
toYear(date) AS year,
209219
sum(houses_sold)*1000
210-
FROM file('/Users/sstruw/Desktop/housing_in_london_monthly_variables.csv')
220+
FROM file('/Users/datasci/Desktop/housing_in_london_monthly_variables.csv')
211221
WHERE area = 'city of london' AND houses_sold IS NOT NULL
212222
GROUP BY toYear(date)
213223
ORDER BY year;
@@ -265,9 +275,15 @@ plt.show()
265275

266276
<Image size="md" img={image_9} alt="Plot of remote data set and local data set"/>
267277

268-
It looks like housing prices in London have steadily risen over the years, while the number of houses sold has fluctuated greatly over time but generally trends downwards, at times even dropping below 1995 levels.
269-
Yikes!
278+
From the plotted data, we see that sales started around 160000 in the year 1995 and surged quickly, peaking at around 540000 in 19999.
279+
After that, volumes declined sharply through the mid-2000s, dropping severely during the 2007-2008 financial crisis and falling to around 140 000.
280+
Prices on the other hand showed steady, consistent growth from about £150,000 in 1995 to around £300,000 by 2005.
281+
Growth accelerated significantly after 2012, rising steeply from roughly £400,000 to over £1,000,000 by 2019.
282+
Unlike sales volume, prices showed minimal impact from the 2008 crisis and maintained an upward trajectory. Yikes!
270283

271284
## Summary {#summary}
272285

273-
Whilst your average London-based data scientist may not be able to afford their own home any time soon, chdb allows you to easily work with data from multiple sources like ClickHouse Cloud and local CSV files easily in Jupyter notebook using the libraries you know and love like Pandas and matplotlib.
286+
This guide demonstrated how chDB enables seamless data exploration in Jupyter notebooks by connecting ClickHouse Cloud with local data sources.
287+
Using the UK Property Price dataset, we showed how to query remote ClickHouse Cloud data with the `remoteSecure()` function, read local CSV files with the `file()` table engine, and convert results directly to Pandas DataFrames for analysis and visualization.
288+
Through chDB, data scientists can leverage ClickHouse's powerful SQL capabilities alongside familiar Python tools like Pandas and matplotlib, making it easy to combine multiple data sources for comprehensive analysis.
289+
While many a London-based data scientist may not be able to afford their own home or apartment any time soon, at least they can analyze the market that priced them out!

0 commit comments

Comments
 (0)