|
| 1 | +--- |
| 2 | +title: 'Working with JSON in ClickHouse' |
| 3 | +sidebar_label: 'Working with JSON' |
| 4 | +slug: /integrations/clickpipes/mongodb/quickstart |
| 5 | +description: 'Common patterns for working with JSON data replicated from MongoDB to ClickHouse via ClickPipes' |
| 6 | +--- |
| 7 | + |
| 8 | +# Working with JSON in ClickHouse |
| 9 | + |
| 10 | +This guide provides common patterns for working with JSON data replicated from MongoDB to ClickHouse via ClickPipes. |
| 11 | + |
| 12 | +Suppose we created a collection `t1` in MongoDB to track customer orders: |
| 13 | + |
| 14 | +```javascript |
| 15 | +db.t1.insertOne({ |
| 16 | + "order_id": "ORD-001234", |
| 17 | + "customer_id": 98765, |
| 18 | + "status": "completed", |
| 19 | + "total_amount": 299.97, |
| 20 | + "order_date": new Date(), |
| 21 | + "shipping": { |
| 22 | + "method": "express", |
| 23 | + "city": "Seattle", |
| 24 | + "cost": 19.99 |
| 25 | + }, |
| 26 | + "items": [ |
| 27 | + { |
| 28 | + "category": "electronics", |
| 29 | + "price": 149.99 |
| 30 | + }, |
| 31 | + { |
| 32 | + "category": "accessories", |
| 33 | + "price": 24.99 |
| 34 | + } |
| 35 | + ] |
| 36 | +}) |
| 37 | +``` |
| 38 | + |
| 39 | +MongoDB CDC Connector replicates MongoDB documents to ClickHouse using the native JSON data type. The replicated table `t1` in ClickHouse will contain the following row: |
| 40 | + |
| 41 | +```shell |
| 42 | +Row 1: |
| 43 | +────── |
| 44 | +_id: "68a4df4b9fe6c73b541703b0" |
| 45 | +_full_document: {"_id":"68a4df4b9fe6c73b541703b0","customer_id":"98765","items":[{"category":"electronics","price":149.99},{"category":"accessories","price":24.99}],"order_date":"2025-08-19T20:32:11.705Z","order_id":"ORD-001234","shipping":{"city":"Seattle","cost":19.99,"method":"express"},"status":"completed","total_amount":299.97} |
| 46 | +_peerdb_synced_at: 2025-08-19 20:50:42.005000000 |
| 47 | +_peerdb_is_deleted: 0 |
| 48 | +_peerdb_version: 0 |
| 49 | +``` |
| 50 | + |
| 51 | +## Table schema {#table-schema} |
| 52 | + |
| 53 | +The replicated tables use this standard schema: |
| 54 | + |
| 55 | +```shell |
| 56 | +┌─name───────────────┬─type──────────┐ |
| 57 | +│ _id │ String │ |
| 58 | +│ _full_document │ JSON │ |
| 59 | +│ _peerdb_synced_at │ DateTime64(9) │ |
| 60 | +│ _peerdb_version │ Int64 │ |
| 61 | +│ _peerdb_is_deleted │ Int8 │ |
| 62 | +└────────────────────┴───────────────┘ |
| 63 | +``` |
| 64 | + |
| 65 | +- `_id`: Primary key from MongoDB |
| 66 | +- `_full_document`: MongoDB document replicated as JSON data type |
| 67 | +- `_peerdb_synced_at`: Records when the row was last synced |
| 68 | +- `_peerdb_version`: Tracks the version of the row; incremented when the row is updated or deleted |
| 69 | +- `_peerdb_is_deleted`: Marks whether the row is deleted |
| 70 | + |
| 71 | +### ReplacingMergeTree table engine {#replacingmergetree-table-engine} |
| 72 | + |
| 73 | +ClickPipes maps MongoDB collections into ClickHouse using the `ReplacingMergeTree` table engine family. With this engine, updates are modeled as inserts with a newer version (`_peerdb_version`) of the document for a given primary key (`_id`), enabling efficient handling of updates, replaces, and deletes as versioned inserts. |
| 74 | + |
| 75 | +`ReplacingMergeTree` clears out duplicates asynchronously in the background. To guarantee the absence of duplicates for the same row, use the [`FINAL` modifier](https://clickhouse.com/docs/sql-reference/statements/select/from#final-modifier). For example: |
| 76 | + |
| 77 | +```sql |
| 78 | +SELECT * FROM t1 FINAL; |
| 79 | +``` |
| 80 | + |
| 81 | +### Handling deletes {#handling-deletes} |
| 82 | + |
| 83 | +Deletes from MongoDB are propagated as new rows marked as deleted using the `_peerdb_is_deleted` column. You typically want to filter these out in your queries: |
| 84 | + |
| 85 | +```sql |
| 86 | +SELECT * FROM t1 FINAL WHERE _peerdb_is_deleted = 0; |
| 87 | +``` |
| 88 | + |
| 89 | +You can also create a row-level policy to automatically filter out deleted rows instead of specifying the filter in each query: |
| 90 | + |
| 91 | +```sql |
| 92 | +CREATE ROW POLICY policy_name ON t1 |
| 93 | +FOR SELECT USING _peerdb_is_deleted = 0; |
| 94 | +``` |
| 95 | + |
| 96 | +## Querying JSON data {#querying-json-data} |
| 97 | + |
| 98 | +You can directly query JSON fields using dot syntax: |
| 99 | + |
| 100 | +```sql |
| 101 | +SELECT |
| 102 | + _full_document.order_id, |
| 103 | + _full_document.shipping.method |
| 104 | +FROM t1; |
| 105 | +``` |
| 106 | + |
| 107 | +Result: |
| 108 | +```shell |
| 109 | +┌─_full_document.order_id─┬─_full_document.shipping.method─┐ |
| 110 | +│ ORD-001234 │ express │ |
| 111 | +└─────────────────────────┴────────────────────────────────┘ |
| 112 | +``` |
| 113 | + |
| 114 | +### Dynamic type {#dynamic-type} |
| 115 | + |
| 116 | +In ClickHouse, each field in JSON has `Dynamic` type. Dynamic type allows ClickHouse to store values of any type without knowing the type in advance. You can verify this with the `toTypeName` function: |
| 117 | + |
| 118 | +```sql |
| 119 | +SELECT toTypeName(_full_document.customer_id) AS type FROM t1; |
| 120 | +``` |
| 121 | + |
| 122 | +Result: |
| 123 | +```shell |
| 124 | +┌─type────┐ |
| 125 | +│ Dynamic │ |
| 126 | +└─────────┘ |
| 127 | +``` |
| 128 | + |
| 129 | +To examine the underlying data type(s) for a field, you can check with the `dynamicType` function. Note that it's possible to have different data types for the same field name in different rows: |
| 130 | + |
| 131 | +```sql |
| 132 | +SELECT dynamicType(_full_document.customer_id) AS type FROM t1; |
| 133 | +``` |
| 134 | + |
| 135 | +Result: |
| 136 | +```shell |
| 137 | +┌─type──┐ |
| 138 | +│ Int64 │ |
| 139 | +└───────┘ |
| 140 | +``` |
| 141 | + |
| 142 | +[Regular functions](https://clickhouse.com/docs/sql-reference/functions/regular-functions) work for dynamic type just like they do for regular columns: |
| 143 | + |
| 144 | +**Example 1: Date parsing** |
| 145 | +```sql |
| 146 | +SELECT parseDateTimeBestEffortOrNull(_full_document.order_date) AS order_date FROM t1; |
| 147 | +``` |
| 148 | + |
| 149 | +Result: |
| 150 | +```shell |
| 151 | +┌─order_date──────────┐ |
| 152 | +│ 2025-08-19 20:32:11 │ |
| 153 | +└─────────────────────┘ |
| 154 | +``` |
| 155 | + |
| 156 | +**Example 2: Conditional logic** |
| 157 | +```sql |
| 158 | +SELECT multiIf( |
| 159 | + _full_document.total_amount < 100, 'less_than_100', |
| 160 | + _full_document.total_amount < 1000, 'less_than_1000', |
| 161 | + '1000+') AS spendings |
| 162 | +FROM t1; |
| 163 | +``` |
| 164 | + |
| 165 | +Result: |
| 166 | +```shell |
| 167 | +┌─spendings──────┐ |
| 168 | +│ less_than_1000 │ |
| 169 | +└────────────────┘ |
| 170 | +``` |
| 171 | + |
| 172 | +**Example 3: Array operations** |
| 173 | +```sql |
| 174 | +SELECT length(_full_document.items) AS item_count FROM t1; |
| 175 | +``` |
| 176 | + |
| 177 | +Result: |
| 178 | +```shell |
| 179 | +┌─item_count─┐ |
| 180 | +│ 2 │ |
| 181 | +└────────────┘ |
| 182 | +``` |
| 183 | + |
| 184 | +### Field casting {#field-casting} |
| 185 | + |
| 186 | +[Aggregation functions](https://clickhouse.com/docs/sql-reference/aggregate-functions/combinators) in ClickHouse don't work with dynamic type directly. For example, if you attempt to directly use the `sum` function on a dynamic type, you get the following error: |
| 187 | + |
| 188 | +```sql |
| 189 | +SELECT sum(_full_document.shipping.cost) AS shipping_cost FROM t1; |
| 190 | +-- DB::Exception: Illegal type Dynamic of argument for aggregate function sum. (ILLEGAL_TYPE_OF_ARGUMENT) |
| 191 | +``` |
| 192 | + |
| 193 | +To use aggregation functions, cast the field to the appropriate type with the `CAST` function or `::` syntax: |
| 194 | + |
| 195 | +```sql |
| 196 | +SELECT sum(_full_document.shipping.cost::Float32) AS shipping_cost FROM t1; |
| 197 | +``` |
| 198 | + |
| 199 | +Result: |
| 200 | +```shell |
| 201 | +┌─shipping_cost─┐ |
| 202 | +│ 19.99 │ |
| 203 | +└───────────────┘ |
| 204 | +``` |
| 205 | + |
| 206 | +:::note |
| 207 | +Casting from dynamic type to the underlying data type (determined by `dynamicType`) is very performant, as ClickHouse already stores the value in its underlying type internally. |
| 208 | +::: |
| 209 | + |
| 210 | +## Flattening JSON {#flattening-json} |
| 211 | + |
| 212 | +### Normal view {#normal-view} |
| 213 | + |
| 214 | +You can create normal views on top of the JSON table to encapsulate flattening/casting/transformation logic in order to query data similar to a relational table. Normal views are lightweight as they only store the query itself, not the underlying data. For example: |
| 215 | + |
| 216 | +```sql |
| 217 | +CREATE VIEW v1 AS |
| 218 | +SELECT |
| 219 | + CAST(_full_document._id, 'String') AS object_id, |
| 220 | + CAST(_full_document.order_id, 'String') AS order_id, |
| 221 | + CAST(_full_document.customer_id, 'Int64') AS customer_id, |
| 222 | + CAST(_full_document.status, 'String') AS status, |
| 223 | + CAST(_full_document.total_amount, 'Decimal64(2)') AS total_amount, |
| 224 | + CAST(parseDateTime64BestEffortOrNull(_full_document.order_date, 3), 'DATETIME(3)') AS order_date, |
| 225 | + CAST(_full_document.shipping.method, 'String') AS shipping_method, |
| 226 | + CAST(_full_document.shipping.city, 'String') AS shipping_city, |
| 227 | + CAST(_full_document.shipping.cost, 'Decimal64(2)') AS shipping_cost, |
| 228 | + _full_document.items AS items |
| 229 | +FROM t1 FINAL |
| 230 | +WHERE _peerdb_is_deleted = 0; |
| 231 | +``` |
| 232 | + |
| 233 | +This view will have the following schema: |
| 234 | + |
| 235 | +```shell |
| 236 | +┌─name────────────┬─type───────────┐ |
| 237 | +│ object_id │ String │ |
| 238 | +│ order_id │ String │ |
| 239 | +│ customer_id │ Int64 │ |
| 240 | +│ status │ String │ |
| 241 | +│ total_amount │ Decimal(18, 2) │ |
| 242 | +│ order_date │ DateTime64(3) │ |
| 243 | +│ shipping_method │ String │ |
| 244 | +│ shipping_city │ String │ |
| 245 | +│ shipping_cost │ Decimal(18, 2) │ |
| 246 | +│ items │ Dynamic │ |
| 247 | +└─────────────────┴────────────────┘ |
| 248 | +``` |
| 249 | + |
| 250 | +You can now query the view similar to how you would query a flattened table: |
| 251 | + |
| 252 | +```sql |
| 253 | +SELECT |
| 254 | + customer_id, |
| 255 | + sum(total_amount) |
| 256 | +FROM v1 |
| 257 | +WHERE shipping_city = 'Seattle' |
| 258 | +GROUP BY customer_id |
| 259 | +ORDER BY customer_id DESC |
| 260 | +LIMIT 10; |
| 261 | +``` |
| 262 | + |
| 263 | +### Refreshable materialized view {#refreshable-materialized-view} |
| 264 | + |
| 265 | +You can also create [Refreshable Materialized Views](https://clickhouse.com/docs/materialized-view/refreshable-materialized-view), which enable you to schedule query execution for deduplicating rows and storing the results in a flattened destination table. With each scheduled refresh, the destination table is replaced with the latest query results. |
| 266 | + |
| 267 | +The key advantage of this method is that the query using the `FINAL` keyword runs only once during the refresh, eliminating the need for subsequent queries on the destination table to use `FINAL`. |
| 268 | + |
| 269 | +However, a drawback is that the data in the destination table is only as up-to-date as the most recent refresh. For many use cases, refresh intervals ranging from several minutes to a few hours provide a good balance between data freshness and query performance. |
| 270 | + |
| 271 | +```sql |
| 272 | +CREATE TABLE flattened_t1 ( |
| 273 | + `_id` String, |
| 274 | + `order_id` String, |
| 275 | + `customer_id` Int64, |
| 276 | + `status` String, |
| 277 | + `total_amount` Decimal(18, 2), |
| 278 | + `order_date` DateTime64(3), |
| 279 | + `shipping_method` String, |
| 280 | + `shipping_city` String, |
| 281 | + `shipping_cost` Decimal(18, 2), |
| 282 | + `items` Dynamic |
| 283 | +) |
| 284 | +ENGINE = ReplacingMergeTree() |
| 285 | +PRIMARY KEY _id |
| 286 | +ORDER BY _id; |
| 287 | + |
| 288 | +CREATE MATERIALIZED VIEW mv1 REFRESH EVERY 1 HOUR TO flattened_t1 AS |
| 289 | +SELECT |
| 290 | + CAST(_full_document._id, 'String') AS _id, |
| 291 | + CAST(_full_document.order_id, 'String') AS order_id, |
| 292 | + CAST(_full_document.customer_id, 'Int64') AS customer_id, |
| 293 | + CAST(_full_document.status, 'String') AS status, |
| 294 | + CAST(_full_document.total_amount, 'Decimal64(2)') AS total_amount, |
| 295 | + CAST(parseDateTime64BestEffortOrNull(_full_document.order_date, 3), 'DATETIME(3)') AS order_date, |
| 296 | + CAST(_full_document.shipping.method, 'String') AS shipping_method, |
| 297 | + CAST(_full_document.shipping.city, 'String') AS shipping_city, |
| 298 | + CAST(_full_document.shipping.cost, 'Decimal64(2)') AS shipping_cost, |
| 299 | + _full_document.items AS items |
| 300 | +FROM t1 FINAL |
| 301 | +WHERE _peerdb_is_deleted = 0; |
| 302 | +``` |
| 303 | + |
| 304 | +You can now query the table `flattened_t1` directly without the `FINAL` modifier: |
| 305 | + |
| 306 | +```sql |
| 307 | +SELECT |
| 308 | + customer_id, |
| 309 | + sum(total_amount) |
| 310 | +FROM flattened_t1 |
| 311 | +WHERE shipping_city = 'Seattle' |
| 312 | +GROUP BY customer_id |
| 313 | +ORDER BY customer_id DESC |
| 314 | +LIMIT 10; |
| 315 | +``` |
0 commit comments