Skip to content

Commit ad918d5

Browse files
committed
mongo: add a quick-start page for working with json
1 parent b883f9f commit ad918d5

File tree

2 files changed

+316
-0
lines changed

2 files changed

+316
-0
lines changed
Lines changed: 315 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
---
2+
title: 'Working with JSON in ClickHouse'
3+
sidebar_label: 'Working with JSON'
4+
slug: /integrations/clickpipes/mongodb/quickstart
5+
description: 'Common patterns for working with JSON data replicated from MongoDB to ClickHouse via ClickPipes'
6+
---
7+
8+
# Working with JSON in ClickHouse
9+
10+
This guide provides common patterns for working with JSON data replicated from MongoDB to ClickHouse via ClickPipes.
11+
12+
Suppose we created a collection `t1` in MongoDB to track customer orders:
13+
14+
```javascript
15+
db.t1.insertOne({
16+
"order_id": "ORD-001234",
17+
"customer_id": 98765,
18+
"status": "completed",
19+
"total_amount": 299.97,
20+
"order_date": new Date(),
21+
"shipping": {
22+
"method": "express",
23+
"city": "Seattle",
24+
"cost": 19.99
25+
},
26+
"items": [
27+
{
28+
"category": "electronics",
29+
"price": 149.99
30+
},
31+
{
32+
"category": "accessories",
33+
"price": 24.99
34+
}
35+
]
36+
})
37+
```
38+
39+
MongoDB CDC Connector replicates MongoDB documents to ClickHouse using the native JSON data type. The replicated table `t1` in ClickHouse will contain the following row:
40+
41+
```shell
42+
Row 1:
43+
──────
44+
_id: "68a4df4b9fe6c73b541703b0"
45+
_full_document: {"_id":"68a4df4b9fe6c73b541703b0","customer_id":"98765","items":[{"category":"electronics","price":149.99},{"category":"accessories","price":24.99}],"order_date":"2025-08-19T20:32:11.705Z","order_id":"ORD-001234","shipping":{"city":"Seattle","cost":19.99,"method":"express"},"status":"completed","total_amount":299.97}
46+
_peerdb_synced_at: 2025-08-19 20:50:42.005000000
47+
_peerdb_is_deleted: 0
48+
_peerdb_version: 0
49+
```
50+
51+
## Table schema {#table-schema}
52+
53+
The replicated tables use this standard schema:
54+
55+
```shell
56+
┌─name───────────────┬─type──────────┐
57+
│ _id │ String │
58+
│ _full_document │ JSON │
59+
│ _peerdb_synced_at │ DateTime64(9) │
60+
│ _peerdb_version │ Int64 │
61+
│ _peerdb_is_deleted │ Int8 │
62+
└────────────────────┴───────────────┘
63+
```
64+
65+
- `_id`: Primary key from MongoDB
66+
- `_full_document`: MongoDB document replicated as JSON data type
67+
- `_peerdb_synced_at`: Records when the row was last synced
68+
- `_peerdb_version`: Tracks the version of the row; incremented when the row is updated or deleted
69+
- `_peerdb_is_deleted`: Marks whether the row is deleted
70+
71+
### ReplacingMergeTree table engine {#replacingmergetree-table-engine}
72+
73+
ClickPipes maps MongoDB collections into ClickHouse using the `ReplacingMergeTree` table engine family. With this engine, updates are modeled as inserts with a newer version (`_peerdb_version`) of the document for a given primary key (`_id`), enabling efficient handling of updates, replaces, and deletes as versioned inserts.
74+
75+
`ReplacingMergeTree` clears out duplicates asynchronously in the background. To guarantee the absence of duplicates for the same row, use the [`FINAL` modifier](https://clickhouse.com/docs/sql-reference/statements/select/from#final-modifier). For example:
76+
77+
```sql
78+
SELECT * FROM t1 FINAL;
79+
```
80+
81+
### Handling deletes {#handling-deletes}
82+
83+
Deletes from MongoDB are propagated as new rows marked as deleted using the `_peerdb_is_deleted` column. You typically want to filter these out in your queries:
84+
85+
```sql
86+
SELECT * FROM t1 FINAL WHERE _peerdb_is_deleted = 0;
87+
```
88+
89+
You can also create a row-level policy to automatically filter out deleted rows instead of specifying the filter in each query:
90+
91+
```sql
92+
CREATE ROW POLICY policy_name ON t1
93+
FOR SELECT USING _peerdb_is_deleted = 0;
94+
```
95+
96+
## Querying JSON data {#querying-json-data}
97+
98+
You can directly query JSON fields using dot syntax:
99+
100+
```sql
101+
SELECT
102+
_full_document.order_id,
103+
_full_document.shipping.method
104+
FROM t1;
105+
```
106+
107+
Result:
108+
```shell
109+
┌─_full_document.order_id─┬─_full_document.shipping.method─┐
110+
│ ORD-001234 │ express │
111+
└─────────────────────────┴────────────────────────────────┘
112+
```
113+
114+
### Dynamic type {#dynamic-type}
115+
116+
In ClickHouse, each field in JSON has `Dynamic` type. Dynamic type allows ClickHouse to store values of any type without knowing the type in advance. You can verify this with the `toTypeName` function:
117+
118+
```sql
119+
SELECT toTypeName(_full_document.customer_id) AS type FROM t1;
120+
```
121+
122+
Result:
123+
```shell
124+
┌─type────┐
125+
│ Dynamic │
126+
└─────────┘
127+
```
128+
129+
To examine the underlying data type(s) for a field, you can check with the `dynamicType` function. Note that it's possible to have different data types for the same field name in different rows:
130+
131+
```sql
132+
SELECT dynamicType(_full_document.customer_id) AS type FROM t1;
133+
```
134+
135+
Result:
136+
```shell
137+
┌─type──┐
138+
│ Int64 │
139+
└───────┘
140+
```
141+
142+
[Regular functions](https://clickhouse.com/docs/sql-reference/functions/regular-functions) work for dynamic type just like they do for regular columns:
143+
144+
**Example 1: Date parsing**
145+
```sql
146+
SELECT parseDateTimeBestEffortOrNull(_full_document.order_date) AS order_date FROM t1;
147+
```
148+
149+
Result:
150+
```shell
151+
┌─order_date──────────┐
152+
│ 2025-08-19 20:32:11 │
153+
└─────────────────────┘
154+
```
155+
156+
**Example 2: Conditional logic**
157+
```sql
158+
SELECT multiIf(
159+
_full_document.total_amount < 100, 'less_than_100',
160+
_full_document.total_amount < 1000, 'less_than_1000',
161+
'1000+') AS spendings
162+
FROM t1;
163+
```
164+
165+
Result:
166+
```shell
167+
┌─spendings──────┐
168+
│ less_than_1000 │
169+
└────────────────┘
170+
```
171+
172+
**Example 3: Array operations**
173+
```sql
174+
SELECT length(_full_document.items) AS item_count FROM t1;
175+
```
176+
177+
Result:
178+
```shell
179+
┌─item_count─┐
180+
│ 2 │
181+
└────────────┘
182+
```
183+
184+
### Field casting {#field-casting}
185+
186+
[Aggregation functions](https://clickhouse.com/docs/sql-reference/aggregate-functions/combinators) in ClickHouse don't work with dynamic type directly. For example, if you attempt to directly use the `sum` function on a dynamic type, you get the following error:
187+
188+
```sql
189+
SELECT sum(_full_document.shipping.cost) AS shipping_cost FROM t1;
190+
-- DB::Exception: Illegal type Dynamic of argument for aggregate function sum. (ILLEGAL_TYPE_OF_ARGUMENT)
191+
```
192+
193+
To use aggregation functions, cast the field to the appropriate type with the `CAST` function or `::` syntax:
194+
195+
```sql
196+
SELECT sum(_full_document.shipping.cost::Float32) AS shipping_cost FROM t1;
197+
```
198+
199+
Result:
200+
```shell
201+
┌─shipping_cost─┐
202+
│ 19.99 │
203+
└───────────────┘
204+
```
205+
206+
:::note
207+
Casting from dynamic type to the underlying data type (determined by `dynamicType`) is very performant, as ClickHouse already stores the value in its underlying type internally.
208+
:::
209+
210+
## Flattening JSON {#flattening-json}
211+
212+
### Normal view {#normal-view}
213+
214+
You can create normal views on top of the JSON table to encapsulate flattening/casting/transformation logic in order to query data similar to a relational table. Normal views are lightweight as they only store the query itself, not the underlying data. For example:
215+
216+
```sql
217+
CREATE VIEW v1 AS
218+
SELECT
219+
CAST(_full_document._id, 'String') AS object_id,
220+
CAST(_full_document.order_id, 'String') AS order_id,
221+
CAST(_full_document.customer_id, 'Int64') AS customer_id,
222+
CAST(_full_document.status, 'String') AS status,
223+
CAST(_full_document.total_amount, 'Decimal64(2)') AS total_amount,
224+
CAST(parseDateTime64BestEffortOrNull(_full_document.order_date, 3), 'DATETIME(3)') AS order_date,
225+
CAST(_full_document.shipping.method, 'String') AS shipping_method,
226+
CAST(_full_document.shipping.city, 'String') AS shipping_city,
227+
CAST(_full_document.shipping.cost, 'Decimal64(2)') AS shipping_cost,
228+
_full_document.items AS items
229+
FROM t1 FINAL
230+
WHERE _peerdb_is_deleted = 0;
231+
```
232+
233+
This view will have the following schema:
234+
235+
```shell
236+
┌─name────────────┬─type───────────┐
237+
│ object_id │ String │
238+
│ order_id │ String │
239+
│ customer_id │ Int64 │
240+
│ status │ String │
241+
│ total_amount │ Decimal(18, 2) │
242+
│ order_date │ DateTime64(3) │
243+
│ shipping_method │ String │
244+
│ shipping_city │ String │
245+
│ shipping_cost │ Decimal(18, 2) │
246+
│ items │ Dynamic │
247+
└─────────────────┴────────────────┘
248+
```
249+
250+
You can now query the view similar to how you would query a flattened table:
251+
252+
```sql
253+
SELECT
254+
customer_id,
255+
sum(total_amount)
256+
FROM v1
257+
WHERE shipping_city = 'Seattle'
258+
GROUP BY customer_id
259+
ORDER BY customer_id DESC
260+
LIMIT 10;
261+
```
262+
263+
### Refreshable materialized view {#refreshable-materialized-view}
264+
265+
You can also create [Refreshable Materialized Views](https://clickhouse.com/docs/materialized-view/refreshable-materialized-view), which enable you to schedule query execution for deduplicating rows and storing the results in a flattened destination table. With each scheduled refresh, the destination table is replaced with the latest query results.
266+
267+
The key advantage of this method is that the query using the `FINAL` keyword runs only once during the refresh, eliminating the need for subsequent queries on the destination table to use `FINAL`.
268+
269+
However, a drawback is that the data in the destination table is only as up-to-date as the most recent refresh. For many use cases, refresh intervals ranging from several minutes to a few hours provide a good balance between data freshness and query performance.
270+
271+
```sql
272+
CREATE TABLE flattened_t1 (
273+
`_id` String,
274+
`order_id` String,
275+
`customer_id` Int64,
276+
`status` String,
277+
`total_amount` Decimal(18, 2),
278+
`order_date` DateTime64(3),
279+
`shipping_method` String,
280+
`shipping_city` String,
281+
`shipping_cost` Decimal(18, 2),
282+
`items` Dynamic
283+
)
284+
ENGINE = ReplacingMergeTree()
285+
PRIMARY KEY _id
286+
ORDER BY _id;
287+
288+
CREATE MATERIALIZED VIEW mv1 REFRESH EVERY 1 HOUR TO flattened_t1 AS
289+
SELECT
290+
CAST(_full_document._id, 'String') AS _id,
291+
CAST(_full_document.order_id, 'String') AS order_id,
292+
CAST(_full_document.customer_id, 'Int64') AS customer_id,
293+
CAST(_full_document.status, 'String') AS status,
294+
CAST(_full_document.total_amount, 'Decimal64(2)') AS total_amount,
295+
CAST(parseDateTime64BestEffortOrNull(_full_document.order_date, 3), 'DATETIME(3)') AS order_date,
296+
CAST(_full_document.shipping.method, 'String') AS shipping_method,
297+
CAST(_full_document.shipping.city, 'String') AS shipping_city,
298+
CAST(_full_document.shipping.cost, 'Decimal64(2)') AS shipping_cost,
299+
_full_document.items AS items
300+
FROM t1 FINAL
301+
WHERE _peerdb_is_deleted = 0;
302+
```
303+
304+
You can now query the table `flattened_t1` directly without the `FINAL` modifier:
305+
306+
```sql
307+
SELECT
308+
customer_id,
309+
sum(total_amount)
310+
FROM flattened_t1
311+
WHERE shipping_city = 'Seattle'
312+
GROUP BY customer_id
313+
ORDER BY customer_id DESC
314+
LIMIT 10;
315+
```

sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -810,6 +810,7 @@ const sidebars = {
810810
items: [
811811
"integrations/data-ingestion/clickpipes/mongodb/index",
812812
"integrations/data-ingestion/clickpipes/mongodb/datatypes",
813+
"integrations/data-ingestion/clickpipes/mongodb/quickstart",
813814
{
814815
type: "category",
815816
label: "Operations",

0 commit comments

Comments
 (0)