Skip to content

Commit a491854

Browse files
committed
add nessie catalog support
1 parent b6e22b1 commit a491854

File tree

4 files changed

+288
-1
lines changed

4 files changed

+288
-1
lines changed

docs/integrations/index.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no
246246
|Redis|<Redissvg alt="Redis logo" style={{width: '3rem', 'height': '3rem'}}/>|Data ingestion|Allows ClickHouse to use [Redis](https://redis.io/) as a dictionary source.|[Documentation](/sql-reference/dictionaries/index.md#redis)|
247247
|Redpanda|<Image img={redpanda} alt="Redpanda logo" size="logo"/>|Data ingestion|Redpanda is the streaming data platform for developers. It's API-compatible with Apache Kafka, but 10x faster, much easier to use, and more cost effective|[Blog](https://redpanda.com/blog/real-time-olap-database-clickhouse-redpanda)|
248248
|REST Catalog||Data ingestion|Integration with REST Catalog specification for Iceberg tables, supporting multiple catalog providers including Tabular.io.|[Documentation](/use-cases/data-lake/rest-catalog)|
249+
|Nessie||Data ingestion|Integration with Nessie, an open-source transactional catalog for data lakes with Git-like data version control.|[Documentation](/use-cases/data-lake/nessie-catalog)|
249250
|Rust|<Image img={rust} size="logo" alt="Rust logo"/>|Language client|A typed client for ClickHouse|[Documentation](/integrations/language-clients/rust.md)|
250251
|SQLite|<Sqlitesvg alt="Sqlite logo" style={{width: '3rem', 'height': '3rem'}}/>|Data ingestion|Allows to import and export data to SQLite and supports queries to SQLite tables directly from ClickHouse.|[Documentation](/engines/table-engines/integrations/sqlite)|
251252
|Superset|<Supersetsvg alt="Superset logo" style={{width: '3rem'}}/>|Data visualization|Explore and visualize your ClickHouse data with Apache Superset.|[Documentation](/integrations/data-visualization/superset-and-clickhouse.md)|

docs/use-cases/data_lake/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,4 @@ ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polar
1414
| [Querying data in S3 using ClickHouse and the Glue Data Catalog](/use-cases/data-lake/glue-catalog) | Query your data in S3 buckets using ClickHouse and the Glue Data Catalog. |
1515
| [Querying data in S3 using ClickHouse and the Unity Data Catalog](/use-cases/data-lake/unity-catalog) | Query your using the Unity Catalog. |
1616
| [Querying data in S3 using ClickHouse and the REST Catalog](/use-cases/data-lake/rest-catalog) | Query your data using the REST Catalog (Tabular.io). |
17+
| [Querying data in S3 using ClickHouse and the Nessie Catalog](/use-cases/data-lake/nessie-catalog) | Query your data using the Nessie Catalog with Git-like data version control. |
Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
---
2+
slug: /use-cases/data-lake/nessie-catalog
3+
sidebar_label: 'Nessie Catalog'
4+
title: 'Nessie Catalog'
5+
pagination_prev: null
6+
pagination_next: null
7+
description: 'In this guide, we will walk you through the steps to query
8+
your data using ClickHouse and the Nessie Catalog.'
9+
keywords: ['Nessie', 'REST', 'Transactional', 'Data Lake', 'Iceberg', 'Git-like']
10+
show_related_blogs: true
11+
---
12+
13+
import ExperimentalBadge from '@theme/badges/ExperimentalBadge';
14+
15+
<ExperimentalBadge/>
16+
17+
:::note
18+
Integration with the Nessie Catalog works with Iceberg tables only.
19+
This integration supports both AWS S3 and other cloud storage providers.
20+
:::
21+
22+
ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polaris, etc.). This guide will walk you through the steps to query your data using ClickHouse and the [Nessie](https://projectnessie.org/) catalog.
23+
24+
Nessie is an open-source transactional catalog for data lakes that provides:
25+
- **Git-inspired** data version control with branches and commits
26+
- **Cross-table transactions** and visibility guarantees
27+
- **REST API** compliance with the Iceberg REST catalog specification
28+
- **Open data lake** approach supporting Hive, Spark, Dremio, Trino, and more
29+
- **Production-ready** deployment on Docker or Kubernetes
30+
31+
:::note
32+
As this feature is experimental, you will need to enable it using:
33+
`SET allow_experimental_database_iceberg = 1;`
34+
:::
35+
36+
## Local Development Setup {#local-development-setup}
37+
38+
For local development and testing, you can use a containerized Nessie setup. This approach is ideal for learning, prototyping, and development environments.
39+
40+
### Prerequisites {#local-prerequisites}
41+
42+
1. **Docker and Docker Compose**: Ensure Docker is installed and running
43+
2. **Sample Setup**: You can use the official Nessie docker-compose setup
44+
45+
### Setting up Local Nessie Catalog {#setting-up-local-nessie-catalog}
46+
47+
You can use the official [Nessie docker-compose setup](https://projectnessie.org/guides/setting-up/) which provides a complete environment with Nessie, in-memory version store, and MinIO for object storage.
48+
49+
**Step 1:** Create a new folder in which to run the example, then create a file `docker-compose.yml` with the following configuration:
50+
51+
```yaml
52+
version: '3.8'
53+
54+
services:
55+
nessie:
56+
image: ghcr.io/projectnessie/nessie:latest
57+
ports:
58+
- "19120:19120"
59+
environment:
60+
- nessie.version.store.type=IN_MEMORY
61+
- nessie.catalog.default-warehouse=warehouse
62+
- nessie.catalog.warehouses.warehouse.location=s3://my-bucket/
63+
- nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
64+
- nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
65+
- nessie.catalog.service.s3.default-options.path-style-access=true
66+
- nessie.catalog.service.s3.default-options.auth-type=STATIC
67+
- nessie.catalog.secrets.access-key.name=admin
68+
- nessie.catalog.secrets.access-key.secret=password
69+
- nessie.catalog.service.s3.default-options.region=us-east-1
70+
- nessie.server.authentication.enabled=false
71+
depends_on:
72+
minio:
73+
condition: service_healthy
74+
networks:
75+
- iceberg_net
76+
77+
minio:
78+
image: quay.io/minio/minio
79+
ports:
80+
- "9002:9000"
81+
- "9003:9001"
82+
environment:
83+
- MINIO_ROOT_USER=admin
84+
- MINIO_ROOT_PASSWORD=password
85+
- MINIO_REGION=us-east-1
86+
healthcheck:
87+
test: ["CMD", "mc", "ready", "local"]
88+
interval: 5s
89+
timeout: 10s
90+
retries: 5
91+
start_period: 30s
92+
entrypoint: >
93+
/bin/sh -c "
94+
minio server /data --console-address ':9001' &
95+
sleep 10;
96+
mc alias set myminio http://localhost:9000 admin password;
97+
mc mb myminio/my-bucket --ignore-existing;
98+
tail -f /dev/null"
99+
networks:
100+
- iceberg_net
101+
102+
clickhouse:
103+
image: clickhouse/clickhouse-server:head
104+
container_name: nessie-clickhouse
105+
user: '0:0' # Ensures root permissions
106+
ports:
107+
- "8123:8123"
108+
- "9000:9000"
109+
volumes:
110+
- clickhouse_data:/var/lib/clickhouse
111+
- ./clickhouse/data_import:/var/lib/clickhouse/data_import # Mount dataset folder
112+
networks:
113+
- iceberg_net
114+
environment:
115+
- CLICKHOUSE_DB=default
116+
- CLICKHOUSE_USER=default
117+
- CLICKHOUSE_DO_NOT_CHOWN=1
118+
- CLICKHOUSE_PASSWORD=
119+
depends_on:
120+
nessie:
121+
condition: service_started
122+
minio:
123+
condition: service_healthy
124+
125+
volumes:
126+
clickhouse_data:
127+
128+
networks:
129+
iceberg_net:
130+
driver: bridge
131+
```
132+
133+
**Step 2:** Run the following command to start the services:
134+
135+
```bash
136+
docker compose up -d
137+
```
138+
139+
**Step 3:** Wait for all services to be ready. You can check the logs:
140+
141+
```bash
142+
docker-compose logs -f
143+
```
144+
145+
:::note
146+
The Nessie setup uses an in-memory version store and requires that sample data be loaded into the Iceberg tables first. Make sure the environment has created and populated the tables before attempting to query them through ClickHouse.
147+
:::
148+
149+
### Connecting to Local Nessie Catalog {#connecting-to-local-nessie-catalog}
150+
151+
Connect to your ClickHouse container:
152+
153+
```bash
154+
docker exec -it nessie-clickhouse clickhouse-client
155+
```
156+
157+
Then create the database connection to the Nessie catalog:
158+
159+
```sql
160+
SET allow_experimental_database_iceberg = 1;
161+
162+
CREATE DATABASE demo
163+
ENGINE = DataLakeCatalog('http://nessie:19120/iceberg', 'admin', 'password')
164+
SETTINGS catalog_type = 'rest', storage_endpoint = 'http://minio:9002/my-bucket', warehouse = 'warehouse'
165+
```
166+
167+
## Querying Nessie catalog tables using ClickHouse {#querying-nessie-catalog-tables-using-clickhouse}
168+
169+
Now that the connection is in place, you can start querying via the Nessie catalog. For example:
170+
171+
```sql
172+
USE demo;
173+
174+
SHOW TABLES;
175+
```
176+
177+
If your setup includes sample data (such as the taxi dataset), you should see tables like:
178+
179+
```sql title="Response"
180+
┌─name──────────┐
181+
default.taxis
182+
└───────────────┘
183+
```
184+
185+
:::note
186+
If you don't see any tables, this usually means:
187+
1. The environment hasn't created the sample tables yet
188+
2. The Nessie catalog service isn't fully initialized
189+
3. The sample data loading process hasn't completed
190+
191+
You can check the Nessie logs to see the catalog activity:
192+
```bash
193+
docker-compose logs nessie
194+
```
195+
:::
196+
197+
To query a table (if available):
198+
199+
```sql
200+
SELECT count(*) FROM `default.taxis`;
201+
```
202+
203+
```sql title="Response"
204+
┌─count()─┐
205+
2171187
206+
└─────────┘
207+
```
208+
209+
:::note Backticks required
210+
Backticks are required because ClickHouse doesn't support more than one namespace.
211+
:::
212+
213+
To inspect the table DDL:
214+
215+
```sql
216+
SHOW CREATE TABLE `default.taxis`;
217+
```
218+
219+
```sql title="Response"
220+
┌─statement─────────────────────────────────────────────────────────────────────────────────────┐
221+
│ CREATE TABLE demo.`default.taxis`
222+
│ ( │
223+
`VendorID` Nullable(Int64), │
224+
`tpep_pickup_datetime` Nullable(DateTime64(6)), │
225+
`tpep_dropoff_datetime` Nullable(DateTime64(6)), │
226+
`passenger_count` Nullable(Float64), │
227+
`trip_distance` Nullable(Float64), │
228+
`RatecodeID` Nullable(Float64), │
229+
`store_and_fwd_flag` Nullable(String), │
230+
`PULocationID` Nullable(Int64), │
231+
`DOLocationID` Nullable(Int64), │
232+
`payment_type` Nullable(Int64), │
233+
`fare_amount` Nullable(Float64), │
234+
`extra` Nullable(Float64), │
235+
`mta_tax` Nullable(Float64), │
236+
`tip_amount` Nullable(Float64), │
237+
`tolls_amount` Nullable(Float64), │
238+
`improvement_surcharge` Nullable(Float64), │
239+
`total_amount` Nullable(Float64), │
240+
`congestion_surcharge` Nullable(Float64), │
241+
`airport_fee` Nullable(Float64) │
242+
│ ) │
243+
│ ENGINE = Iceberg('http://localhost:9002/my-bucket/default/taxis/', 'admin', '[HIDDEN]') │
244+
└───────────────────────────────────────────────────────────────────────────────────────────────┘
245+
```
246+
247+
## Loading data from your Data Lake into ClickHouse {#loading-data-from-your-data-lake-into-clickhouse}
248+
249+
If you need to load data from the Nessie catalog into ClickHouse, start by creating a local ClickHouse table:
250+
251+
```sql
252+
CREATE TABLE taxis
253+
(
254+
`VendorID` Int64,
255+
`tpep_pickup_datetime` DateTime64(6),
256+
`tpep_dropoff_datetime` DateTime64(6),
257+
`passenger_count` Float64,
258+
`trip_distance` Float64,
259+
`RatecodeID` Float64,
260+
`store_and_fwd_flag` String,
261+
`PULocationID` Int64,
262+
`DOLocationID` Int64,
263+
`payment_type` Int64,
264+
`fare_amount` Float64,
265+
`extra` Float64,
266+
`mta_tax` Float64,
267+
`tip_amount` Float64,
268+
`tolls_amount` Float64,
269+
`improvement_surcharge` Float64,
270+
`total_amount` Float64,
271+
`congestion_surcharge` Float64,
272+
`airport_fee` Float64
273+
)
274+
ENGINE = MergeTree()
275+
PARTITION BY toYYYYMM(tpep_pickup_datetime)
276+
ORDER BY (VendorID, tpep_pickup_datetime, PULocationID, DOLocationID);
277+
```
278+
279+
Then load the data from your Nessie catalog table via an `INSERT INTO SELECT`:
280+
281+
```sql
282+
INSERT INTO taxis
283+
SELECT * FROM demo.`default.taxis`;
284+
```

sidebars.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,8 @@ const sidebars = {
168168
items: [
169169
"use-cases/data_lake/glue_catalog",
170170
"use-cases/data_lake/unity_catalog",
171-
"use-cases/data_lake/rest_catalog"
171+
"use-cases/data_lake/rest_catalog",
172+
"use-cases/data_lake/nessie_catalog"
172173
]
173174
},
174175
{

0 commit comments

Comments
 (0)