Skip to content

Commit 0dc1521

Browse files
authored
Merge pull request #4399 from ClickHouse/faq
Update documentation for MongoDB clickpipes
2 parents f162940 + 973941a commit 0dc1521

File tree

6 files changed

+160
-39
lines changed

6 files changed

+160
-39
lines changed
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
sidebar_label: 'FAQ'
3+
description: 'Frequently asked questions about ClickPipes for MongoDB.'
4+
slug: /integrations/clickpipes/mongodb/faq
5+
sidebar_position: 2
6+
title: 'ClickPipes for MongoDB FAQ'
7+
---
8+
9+
# ClickPipes for MongoDB FAQ
10+
11+
### Can I query for individual fields in the JSON datatype? {#can-i-query-for-individual-fields-in-the-json-datatype}
12+
13+
For direct field access, such as `{"user_id": 123}`, you can use **dot notation**:
14+
```sql
15+
SELECT doc.user_id as user_id FROM your_table;
16+
```
17+
For direct field access of nested object fields, such as `{"address": { "city": "San Francisco", "state": "CA" }}`, use the `^` operator:
18+
```sql
19+
SELECT doc.^address.city AS city FROM your_table;
20+
```
21+
For aggregations, cast the field to the appropriate type with the `CAST` function or `::` syntax:
22+
```sql
23+
SELECT sum(doc.shipping.cost::Float32) AS total_shipping_cost FROM t1;
24+
```
25+
To learn more about working with JSON, see our [Working with JSON guide](./quickstart).
26+
27+
### How do I flatten the nested MongoDB documents in ClickHouse? {#how-do-i-flatten-the-nested-mongodb-documents-in-clickhouse}
28+
29+
MongoDB documents are replicated as JSON type in ClickHouse by default, preserving the nested structure. You have several options to flatten this data. If you want to flatten the data to columns, you can use normal views, materialized views, or query-time access.
30+
31+
1. **Normal Views**: Use normal views to encapsulate flattening logic.
32+
2. **Materialized Views**: For smaller datasets, you can use refreshable materialized with the [`FINAL` modifier](/sql-reference/statements/select/from#final-modifier) to periodically flatten and deduplicate data. For larger datasets, we recommend using incremental materialized views without `FINAL` to flatten the data in real-time, and then deduplicate data at query time.
33+
3. **Query-time Access**: Instead of flattening, use dot notation to access nested fields directly in queries.
34+
35+
For detailed examples, see our [Working with JSON guide](./quickstart).
36+
37+
### Can I connect MongoDB databases that don't have a public IP or are in private networks? {#can-i-connect-mongodb-databases-that-dont-have-a-public-ip-or-are-in-private-networks}
38+
39+
We support AWS PrivateLink for connecting to MongoDB databases that don't have a public IP or are in private networks. Azure Private Link and GCP Private Service Connect are currently not supported.
40+
41+
### What happens if I delete a database/table from my MongoDB database? {#what-happens-if-i-delete-a-database-table-from-my-mongodb-database}
42+
43+
When you delete a database/table from MongoDB, ClickPipes will continue running but the dropped database/table will stop replicating changes. The corresponding tables in ClickHouse is preserved.
44+
45+
### How does MongoDB CDC Connector handle transactions? {#how-does-mongodb-cdc-connector-handle-transactions}
46+
47+
Each document change within a transaction is processed individually to ClickHouse. Changes are applied in the order they appear in the oplog; and only committed changes are replicated to ClickHouse. If a MongoDB transaction is rolled back, those changes won't appear in the change stream.
48+
49+
For more examples, see our [Working with JSON guide](./quickstart).
50+
51+
### How do I handle `resume of change stream was not possible, as the resume point may no longer be in the oplog.` error? {#resume-point-may-no-longer-be-in-the-oplog-error}
52+
53+
This error typically occurs when the oplog is truncated and ClickPipe is unable to resume the change stream at the expected point. To resolve this issue, [resync the ClickPipe](./resync.md). To avoid this issue from recurring, we recommend [increasing the oplog retention period](./source/atlas#enable-oplog-retention) (or [here](./source/generic#enable-oplog-retention) if you are on a self-managed MongoDB).
54+
55+
### How is replication managed? {#how-is-replication-managed}
56+
57+
We use MongoDB's native Change Streams API to track changes in the database. Change Streams API provides a resumable stream of database changes by leveraging MongoDB's oplog (operations log). ClickPipe uses MongoDB's resume tokens to track the position in the oplog and ensure every change is replicated to ClickHouse.
58+
59+
### Which read preference should I use? {#which-read-preference-should-i-use}
60+
61+
Which read preference to use depends on your specific use case. If you want to minimize the load on your primary node, we recommend using `secondaryPreferred` read preference. If you want to optimize ingestion latency, we recommend using `primaryPreferred` read preference. For more details, see [MongoDB documentation](https://www.mongodb.com/docs/manual/core/read-preference/#read-preference-modes-1).
62+
63+
### Does the MongoDB ClickPipe support Sharded Cluster? {#does-the-mongodb-clickpipe-support-sharded-cluster}
64+
Yes, the MongoDB ClickPipe supports both Replica Set and Sharded Cluster.

docs/integrations/data-ingestion/clickpipes/mongodb/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ import Image from '@theme/IdealImage';
1919
<BetaBadge/>
2020

2121
:::info
22-
Currently, ingesting data from MongoDB to ClickHouse Cloud via ClickPipes is in Private Preview.
22+
Ingesting data from MongoDB to ClickHouse Cloud via ClickPipes is in public beta.
2323
:::
2424

2525
:::note

docs/integrations/data-ingestion/clickpipes/mongodb/quickstart.md

Lines changed: 91 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ MongoDB CDC Connector replicates MongoDB documents to ClickHouse using the nativ
4242
Row 1:
4343
──────
4444
_id: "68a4df4b9fe6c73b541703b0"
45-
_full_document: {"_id":"68a4df4b9fe6c73b541703b0","customer_id":"98765","items":[{"category":"electronics","price":149.99},{"category":"accessories","price":24.99}],"order_date":"2025-08-19T20:32:11.705Z","order_id":"ORD-001234","shipping":{"city":"Seattle","cost":19.99,"method":"express"},"status":"completed","total_amount":299.97}
45+
doc: {"_id":"68a4df4b9fe6c73b541703b0","customer_id":"98765","items":[{"category":"electronics","price":149.99},{"category":"accessories","price":24.99}],"order_date":"2025-08-19T20:32:11.705Z","order_id":"ORD-001234","shipping":{"city":"Seattle","cost":19.99,"method":"express"},"status":"completed","total_amount":299.97}
4646
_peerdb_synced_at: 2025-08-19 20:50:42.005000000
4747
_peerdb_is_deleted: 0
4848
_peerdb_version: 0
@@ -55,15 +55,15 @@ The replicated tables use this standard schema:
5555
```shell
5656
┌─name───────────────┬─type──────────┐
5757
│ _id │ String │
58-
_full_document │ JSON │
58+
doc │ JSON │
5959
│ _peerdb_synced_at │ DateTime64(9) │
6060
│ _peerdb_version │ Int64 │
6161
│ _peerdb_is_deleted │ Int8 │
6262
└────────────────────┴───────────────┘
6363
```
6464

6565
- `_id`: Primary key from MongoDB
66-
- `_full_document`: MongoDB document replicated as JSON data type
66+
- `doc`: MongoDB document replicated as JSON data type
6767
- `_peerdb_synced_at`: Records when the row was last synced
6868
- `_peerdb_version`: Tracks the version of the row; incremented when the row is updated or deleted
6969
- `_peerdb_is_deleted`: Marks whether the row is deleted
@@ -72,7 +72,7 @@ The replicated tables use this standard schema:
7272

7373
ClickPipes maps MongoDB collections into ClickHouse using the `ReplacingMergeTree` table engine family. With this engine, updates are modeled as inserts with a newer version (`_peerdb_version`) of the document for a given primary key (`_id`), enabling efficient handling of updates, replaces, and deletes as versioned inserts.
7474

75-
`ReplacingMergeTree` clears out duplicates asynchronously in the background. To guarantee the absence of duplicates for the same row, use the [`FINAL` modifier](https://clickhouse.com/docs/sql-reference/statements/select/from#final-modifier). For example:
75+
`ReplacingMergeTree` clears out duplicates asynchronously in the background. To guarantee the absence of duplicates for the same row, use the [`FINAL` modifier](/sql-reference/statements/select/from#final-modifier). For example:
7676

7777
```sql
7878
SELECT * FROM t1 FINAL;
@@ -99,21 +99,21 @@ You can directly query JSON fields using dot syntax:
9999

100100
```sql title="Query"
101101
SELECT
102-
_full_document.order_id,
103-
_full_document.shipping.method
102+
doc.order_id,
103+
doc.shipping.method
104104
FROM t1;
105105
```
106106

107107
```shell title="Result"
108-
─_full_document.order_id─┬─_full_document.shipping.method─┐
109-
│ ORD-001234 │ express
110-
└─────────────────────────┴────────────────────────────────┘
108+
-─doc.order_id─┬─doc.shipping.method─┐
109+
│ ORD-001234 │ express │
110+
└────────────────────────────────────┘
111111
```
112112

113113
When querying _nested object fields_ using dot syntax, make sure to add the [`^`](https://clickhouse.com/docs/sql-reference/data-types/newjson#reading-json-sub-objects-as-sub-columns) operator:
114114

115115
```sql title="Query"
116-
SELECT _full_document.^shipping as shipping_info FROM t1;
116+
SELECT doc.^shipping as shipping_info FROM t1;
117117
```
118118

119119
```shell title="Result"
@@ -127,7 +127,7 @@ SELECT _full_document.^shipping as shipping_info FROM t1;
127127
In ClickHouse, each field in JSON has `Dynamic` type. Dynamic type allows ClickHouse to store values of any type without knowing the type in advance. You can verify this with the `toTypeName` function:
128128

129129
```sql title="Query"
130-
SELECT toTypeName(_full_document.customer_id) AS type FROM t1;
130+
SELECT toTypeName(doc.customer_id) AS type FROM t1;
131131
```
132132

133133
```shell title="Result"
@@ -139,7 +139,7 @@ SELECT toTypeName(_full_document.customer_id) AS type FROM t1;
139139
To examine the underlying data type(s) for a field, you can check with the `dynamicType` function. Note that it's possible to have different data types for the same field name in different rows:
140140

141141
```sql title="Query"
142-
SELECT dynamicType(_full_document.customer_id) AS type FROM t1;
142+
SELECT dynamicType(doc.customer_id) AS type FROM t1;
143143
```
144144

145145
```shell title="Result"
@@ -153,7 +153,7 @@ SELECT dynamicType(_full_document.customer_id) AS type FROM t1;
153153
**Example 1: Date parsing**
154154

155155
```sql title="Query"
156-
SELECT parseDateTimeBestEffortOrNull(_full_document.order_date) AS order_date FROM t1;
156+
SELECT parseDateTimeBestEffortOrNull(doc.order_date) AS order_date FROM t1;
157157
```
158158

159159
```shell title="Result"
@@ -166,8 +166,8 @@ SELECT parseDateTimeBestEffortOrNull(_full_document.order_date) AS order_date FR
166166

167167
```sql title="Query"
168168
SELECT multiIf(
169-
_full_document.total_amount < 100, 'less_than_100',
170-
_full_document.total_amount < 1000, 'less_than_1000',
169+
doc.total_amount < 100, 'less_than_100',
170+
doc.total_amount < 1000, 'less_than_1000',
171171
'1000+') AS spendings
172172
FROM t1;
173173
```
@@ -181,7 +181,7 @@ FROM t1;
181181
**Example 3: Array operations**
182182

183183
```sql title="Query"
184-
SELECT length(_full_document.items) AS item_count FROM t1;
184+
SELECT length(doc.items) AS item_count FROM t1;
185185
```
186186

187187
```shell title="Result"
@@ -195,14 +195,14 @@ SELECT length(_full_document.items) AS item_count FROM t1;
195195
[Aggregation functions](https://clickhouse.com/docs/sql-reference/aggregate-functions/combinators) in ClickHouse don't work with dynamic type directly. For example, if you attempt to directly use the `sum` function on a dynamic type, you get the following error:
196196

197197
```sql
198-
SELECT sum(_full_document.shipping.cost) AS shipping_cost FROM t1;
198+
SELECT sum(doc.shipping.cost) AS shipping_cost FROM t1;
199199
-- DB::Exception: Illegal type Dynamic of argument for aggregate function sum. (ILLEGAL_TYPE_OF_ARGUMENT)
200200
```
201201

202202
To use aggregation functions, cast the field to the appropriate type with the `CAST` function or `::` syntax:
203203

204204
```sql title="Query"
205-
SELECT sum(_full_document.shipping.cost::Float32) AS shipping_cost FROM t1;
205+
SELECT sum(doc.shipping.cost::Float32) AS shipping_cost FROM t1;
206206
```
207207

208208
```shell title="Result"
@@ -224,14 +224,14 @@ You can create normal views on top of the JSON table to encapsulate flattening/c
224224
```sql
225225
CREATE VIEW v1 AS
226226
SELECT
227-
CAST(_full_document._id, 'String') AS object_id,
228-
CAST(_full_document.order_id, 'String') AS order_id,
229-
CAST(_full_document.customer_id, 'Int64') AS customer_id,
230-
CAST(_full_document.status, 'String') AS status,
231-
CAST(_full_document.total_amount, 'Decimal64(2)') AS total_amount,
232-
CAST(parseDateTime64BestEffortOrNull(_full_document.order_date, 3), 'DATETIME(3)') AS order_date,
233-
_full_document.^shipping AS shipping_info,
234-
_full_document.items AS items
227+
CAST(doc._id, 'String') AS object_id,
228+
CAST(doc.order_id, 'String') AS order_id,
229+
CAST(doc.customer_id, 'Int64') AS customer_id,
230+
CAST(doc.status, 'String') AS status,
231+
CAST(doc.total_amount, 'Decimal64(2)') AS total_amount,
232+
CAST(parseDateTime64BestEffortOrNull(doc.order_date, 3), 'DATETIME(3)') AS order_date,
233+
doc.^shipping AS shipping_info,
234+
doc.items AS items
235235
FROM t1 FINAL
236236
WHERE _peerdb_is_deleted = 0;
237237
```
@@ -266,11 +266,11 @@ LIMIT 10;
266266

267267
### Refreshable materialized view {#refreshable-materialized-view}
268268

269-
You can also create [Refreshable Materialized Views](https://clickhouse.com/docs/materialized-view/refreshable-materialized-view), which enable you to schedule query execution for deduplicating rows and storing the results in a flattened destination table. With each scheduled refresh, the destination table is replaced with the latest query results.
269+
You can create [Refreshable Materialized Views](https://clickhouse.com/docs/materialized-view/refreshable-materialized-view), which enable you to schedule query execution for deduplicating rows and storing the results in a flattened destination table. With each scheduled refresh, the destination table is replaced with the latest query results.
270270

271271
The key advantage of this method is that the query using the `FINAL` keyword runs only once during the refresh, eliminating the need for subsequent queries on the destination table to use `FINAL`.
272272

273-
However, a drawback is that the data in the destination table is only as up-to-date as the most recent refresh. For many use cases, refresh intervals ranging from several minutes to a few hours provide a good balance between data freshness and query performance.
273+
A drawback is that the data in the destination table is only as up-to-date as the most recent refresh. For many use cases, refresh intervals ranging from several minutes to a few hours provide a good balance between data freshness and query performance.
274274

275275
```sql
276276
CREATE TABLE flattened_t1 (
@@ -287,16 +287,16 @@ ENGINE = ReplacingMergeTree()
287287
PRIMARY KEY _id
288288
ORDER BY _id;
289289

290-
CREATE MATERIALIZED VIEW mv1 REFRESH EVERY 1 HOUR TO flattened_t1 AS
290+
CREATE MATERIALIZED VIEW rmv REFRESH EVERY 1 HOUR TO flattened_t1 AS
291291
SELECT
292-
CAST(_full_document._id, 'String') AS _id,
293-
CAST(_full_document.order_id, 'String') AS order_id,
294-
CAST(_full_document.customer_id, 'Int64') AS customer_id,
295-
CAST(_full_document.status, 'String') AS status,
296-
CAST(_full_document.total_amount, 'Decimal64(2)') AS total_amount,
297-
CAST(parseDateTime64BestEffortOrNull(_full_document.order_date, 3), 'DATETIME(3)') AS order_date,
298-
_full_document.^shipping AS shipping_info,
299-
_full_document.items AS items
292+
CAST(doc._id, 'String') AS _id,
293+
CAST(doc.order_id, 'String') AS order_id,
294+
CAST(doc.customer_id, 'Int64') AS customer_id,
295+
CAST(doc.status, 'String') AS status,
296+
CAST(doc.total_amount, 'Decimal64(2)') AS total_amount,
297+
CAST(parseDateTime64BestEffortOrNull(doc.order_date, 3), 'DATETIME(3)') AS order_date,
298+
doc.^shipping AS shipping_info,
299+
doc.items AS items
300300
FROM t1 FINAL
301301
WHERE _peerdb_is_deleted = 0;
302302
```
@@ -313,3 +313,57 @@ GROUP BY customer_id
313313
ORDER BY customer_id DESC
314314
LIMIT 10;
315315
```
316+
317+
### Incremental materialized view {#incremental-materialized-view}
318+
319+
If you want to access flattened columns in real-time, you can create [Incremental Materialized Views](https://clickhouse.com/docs/materialized-view/incremental-materialized-view). If your table has frequent updates, it's not recommended to use the `FINAL` modifier in your materialized view as every update will trigger a merge. Instead, you can deduplicate the data at query time by building a normal view on top of the materialized view.
320+
321+
```sql
322+
CREATE TABLE flattened_t1 (
323+
`_id` String,
324+
`order_id` String,
325+
`customer_id` Int64,
326+
`status` String,
327+
`total_amount` Decimal(18, 2),
328+
`order_date` DateTime64(3),
329+
`shipping_info` JSON,
330+
`items` Dynamic,
331+
`_peerdb_version` Int64,
332+
`_peerdb_synced_at` DateTime64(9),
333+
`_peerdb_is_deleted` Int8
334+
)
335+
ENGINE = ReplacingMergeTree()
336+
PRIMARY KEY _id
337+
ORDER BY _id;
338+
339+
CREATE MATERIALIZED VIEW imv TO flattened_t1 AS
340+
SELECT
341+
CAST(doc._id, 'String') AS _id,
342+
CAST(doc.order_id, 'String') AS order_id,
343+
CAST(doc.customer_id, 'Int64') AS customer_id,
344+
CAST(doc.status, 'String') AS status,
345+
CAST(doc.total_amount, 'Decimal64(2)') AS total_amount,
346+
CAST(parseDateTime64BestEffortOrNull(doc.order_date, 3), 'DATETIME(3)') AS order_date,
347+
doc.^shipping AS shipping_info,
348+
doc.items,
349+
_peerdb_version,
350+
_peerdb_synced_at,
351+
_peerdb_is_deleted
352+
FROM t1;
353+
354+
CREATE VIEW flattened_t1_final AS
355+
SELECT * FROM flattened_t1 FINAL WHERE _peerdb_is_deleted = 0;
356+
```
357+
358+
You can now query the view `flattened_t1_final` as follows:
359+
360+
```sql
361+
SELECT
362+
customer_id,
363+
sum(total_amount)
364+
FROM flattened_t1_final
365+
AND shipping_info.city = 'Seattle'
366+
GROUP BY customer_id
367+
ORDER BY customer_id DESC
368+
LIMIT 10;
369+
```

docs/integrations/data-ingestion/clickpipes/mysql/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ import Image from '@theme/IdealImage';
2020
<BetaBadge/>
2121

2222
:::info
23-
Currently, ingesting data from MySQL to ClickHouse Cloud via ClickPipes is in public beta.
23+
Ingesting data from MySQL to ClickHouse Cloud via ClickPipes is in public beta.
2424
:::
2525

2626
You can use ClickPipes to ingest data from your source MySQL database into ClickHouse Cloud. The source MySQL database can be hosted on-premises or in the cloud using services like Amazon RDS, Google Cloud SQL, and others.

scripts/aspell-dict-file.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1103,3 +1103,5 @@ explorative
11031103
ReceiveMessage
11041104
--docs/integrations/data-ingestion/clickpipes/postgres/postgres_generated_columns.md--
11051105
RelationMessage
1106+
--docs/integrations/data-ingestion/clickpipes/mongodb/faq.md--
1107+
resumable

sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -630,6 +630,7 @@ const sidebars = {
630630
"integrations/data-ingestion/clickpipes/mongodb/datatypes",
631631
"integrations/data-ingestion/clickpipes/mongodb/quickstart",
632632
"integrations/data-ingestion/clickpipes/mongodb/lifecycle",
633+
"integrations/data-ingestion/clickpipes/mongodb/faq",
633634
{
634635
type: "category",
635636
label: "Operations",

0 commit comments

Comments
 (0)