You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/data-modeling/backfilling.md
+17-18Lines changed: 17 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,8 @@ description: How to use backfill large datasets in ClickHouse
5
5
keywords: [materialized views, backfilling, inserting data, resilient data load]
6
6
---
7
7
8
+
import nullTableMV from '@site/static/images/data-modeling/null_table_mv.png';
9
+
8
10
# Backfilling Data
9
11
10
12
Whether new to ClickHouse or responsible for an existing deployment, users will invariably need to backfill tables with historical data. In some cases, this is relatively simple but can become more complex when materialized views need to be populated. This guide documents some processes for this task that users can apply to their use case.
@@ -15,7 +17,7 @@ This guide assumes users are already familiar with the concept of [Incremental M
15
17
16
18
## Example dataset {#example-dataset}
17
19
18
-
Throughout this guide, we use a PyPI dataset. Each row in this dataset represents a Python package download using a tool such as `pip`.
20
+
Throughout this guide, we use a PyPI dataset. Each row in this dataset represents a Python package download using a tool such as `pip`.
19
21
20
22
For example, the subset covers a single day - `2024-12-17` and is available publicly at `https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/2024-12-17/`. Users can query with:
21
23
@@ -66,12 +68,12 @@ The full PyPI dataset, consisting of over 1 trillion rows, is available in our p
66
68
67
69
## Backfilling scenarios {#backfilling-scenarios}
68
70
69
-
Backfilling is typically needed when a stream of data is being consumed from a point in time. This data is being inserted into ClickHouse tables with [incremental materialized views](/materialized-view/incremental-materialized-view), triggering on blocks as they are inserted. These views may be transforming the data prior to insert or computing aggregates and sending results to target tables for later use in downstream applications.
71
+
Backfilling is typically needed when a stream of data is being consumed from a point in time. This data is being inserted into ClickHouse tables with [incremental materialized views](/materialized-view/incremental-materialized-view), triggering on blocks as they are inserted. These views may be transforming the data prior to insert or computing aggregates and sending results to target tables for later use in downstream applications.
70
72
71
73
We will attempt to cover the following scenarios:
72
74
73
75
1.**Backfilling data with existing data ingestion** - New data is being loaded, and historical data needs to be backfilled. This historical data has been identified.
74
-
2.**Adding materialized views to existing tables** - New materialized views need to be added to a setup for which historical data has been populated and data is already streaming.
76
+
2.**Adding materialized views to existing tables** - New materialized views need to be added to a setup for which historical data has been populated and data is already streaming.
75
77
76
78
We assume data will be backfilled from object storage. In all cases, we aim to avoid pauses in data insertion.
77
79
@@ -141,7 +143,7 @@ FROM pypi_downloads
141
143
Peak memory usage: 682.38 KiB.
142
144
```
143
145
144
-
Suppose we wish to load another subset `{101..200}`. While we could insert directly into `pypi`, we can do this backfill in isolation by creating duplicate tables.
146
+
Suppose we wish to load another subset `{101..200}`. While we could insert directly into `pypi`, we can do this backfill in isolation by creating duplicate tables.
145
147
146
148
Should the backfill fail, we have not impacted our main tables and can simply [truncate](/managing-data/truncate) our duplicate tables and repeat.
147
149
@@ -236,9 +238,9 @@ FROM pypi_v2
236
238
237
239
Importantly, the `MOVE PARTITION` operation is both lightweight (exploiting hard links) and atomic, i.e. it either fails or succeeds with no intermediate state.
238
240
239
-
We exploit this process heavily in our backfilling scenarios below.
241
+
We exploit this process heavily in our backfilling scenarios below.
240
242
241
-
Notice how this process requires users to choose the size of each insert operation.
243
+
Notice how this process requires users to choose the size of each insert operation.
242
244
243
245
Larger inserts i.e. more rows, will mean fewer `MOVE PARTITION` operations are required. However, this must be balanced against the cost in the event of an insert failure e.g. due to network interruption, to recover. Users can complement this process with batching files to reduce the risk. This can be performed with either range queries e.g. `WHERE timestamp BETWEEN 2024-12-17 09:00:00 AND 2024-12-17 10:00:00` or glob patterns. For example,
244
246
@@ -258,7 +260,7 @@ ClickPipes uses this approach when loading data from object storage, automatical
258
260
259
261
## Scenario 1: Backfilling data with existing data ingestion {#scenario-1-backfilling-data-with-existing-data-ingestion}
260
262
261
-
In this scenario, we assume that the data to backfill is not in an isolated bucket and thus filtering is required. Data is already inserting and a timestamp or monotonically increasing column can be identified from which historical data needs to be backfilled.
263
+
In this scenario, we assume that the data to backfill is not in an isolated bucket and thus filtering is required. Data is already inserting and a timestamp or monotonically increasing column can be identified from which historical data needs to be backfilled.
262
264
263
265
This process follows the following steps:
264
266
@@ -317,7 +319,7 @@ ALTER TABLE pypi_downloads
317
319
If the historical data is an isolated bucket, the above time filter is not required. If a time or monotonic column is unavailable, isolate your historical data.
318
320
319
321
:::note Just use ClickPipes in ClickHouse Cloud
320
-
ClickHouse Cloud users should use ClickPipes for restoring historical backups if the data can be isolated in its own bucket (and a filter is not required). As well as parallelizing the load with multiple workers, thus reducing the load time, ClickPipes automates the above process - creating duplicate tables for both the main table and materialized views.
322
+
ClickHouse Cloud users should use ClickPipes for restoring historical backups if the data can be isolated in its own bucket (and a filter is not required). As well as parallelizing the load with multiple workers, thus reducing the load time, ClickPipes automates the above process - creating duplicate tables for both the main table and materialized views.
321
323
:::
322
324
323
325
## Scenario 2: Adding materialized views to existing tables {#scenario-2-adding-materialized-views-to-existing-tables}
@@ -339,7 +341,7 @@ Our simplest approach involves the following steps:
339
341
340
342
This can be further enhanced to target subsets of data in step (2) and/or use a duplicate target table for the materialized view (attach partitions to the original once the insert is complete) for easier recovery after failure.
341
343
342
-
Consider the following materialized view, which computes the most popular projects per hour.
344
+
Consider the following materialized view, which computes the most popular projects per hour.
343
345
344
346
```sql
345
347
CREATETABLEpypi_downloads_per_day
@@ -372,7 +374,7 @@ AS SELECT
372
374
project, count() AS count
373
375
FROM pypi WHEREtimestamp>='2024-12-17 09:00:00'
374
376
GROUP BY hour, project
375
-
```
377
+
```
376
378
377
379
Once this view is added, we can backfill all data for the materialized view prior to this data.
378
380
@@ -403,7 +405,7 @@ In our case, this is a relatively lightweight aggregation that completes in unde
403
405
404
406
Often materialized view's query can be more complex (not uncommon as otherwise users wouldn't use a view!) and consume resources. In rarer cases, the resources for the query are beyond that of the server. This highlights one of the advantages of ClickHouse materialized views - they are incremental and don't process the entire dataset in one go!
405
407
406
-
In this case, users have several options:
408
+
In this case, users have several options:
407
409
408
410
1. Modify your query to backfill ranges e.g. `WHERE timestamp BETWEEN 2024-12-17 08:00:00 AND 2024-12-17 09:00:00`, `WHERE timestamp BETWEEN 2024-12-17 07:00:00 AND 2024-12-17 08:00:00` etc.
409
411
2. Use a [Null table engine](/engines/table-engines/special/null) to fill the materialized view. This replicates the typical incremental population of a materialized view, executing it's query over blocks of data (of configurable size).
@@ -418,10 +420,7 @@ The [Null table engine](/engines/table-engines/special/null) provides a storage
418
420
419
421
Importantly, any materialized views attached to the table engine still execute over blocks of data as its inserted - sending their results to a target table. These blocks are of a configurable size. While larger blocks can potentially be more efficient (and faster to process), they consume more resources (principally memory). Use of this table engine means we can build our materialized view incrementally i.e. a block at a time, avoiding the need to hold the entire aggregation in memory.
Here, we create a Null table, `pypi_v2,` to receive the rows that will be used to build our materialized view. Note how we limit the schema to only the columns we need. Our materialized view performs an aggregation over rows inserted into this table (one block at a time), sending the results to our target table, `pypi_downloads_per_day`.
450
449
451
450
:::note
452
-
We have used `pypi_downloads_per_day` as our target table here. For additional resiliency, users could create a duplicate table, `pypi_downloads_per_day_v2`, and use this as the target table of the view, as shown in previous examples. On completion of the insert, partitions in `pypi_downloads_per_day_v2` could, in turn, be moved to `pypi_downloads_per_day.` This would allow recovery in the case our insert fails due to memory issues or server interruptions i.e. we just truncate `pypi_downloads_per_day_v2`, tune settings, and retry.
451
+
We have used `pypi_downloads_per_day` as our target table here. For additional resiliency, users could create a duplicate table, `pypi_downloads_per_day_v2`, and use this as the target table of the view, as shown in previous examples. On completion of the insert, partitions in `pypi_downloads_per_day_v2` could, in turn, be moved to `pypi_downloads_per_day.` This would allow recovery in the case our insert fails due to memory issues or server interruptions i.e. we just truncate `pypi_downloads_per_day_v2`, tune settings, and retry.
453
452
:::
454
453
455
454
To populate this materialized view, we simply insert the relevant data to backfill into `pypi_v2` from `pypi.`
@@ -467,8 +466,8 @@ Notice our memory usage here is `639.47 MiB`.
467
466
468
467
Several factors will determine the performance and resources used in the above scenario. We recommend readers understand insert mechanics documented in detail [here](/integrations/s3/performance#using-threads-for-reads) prior to attempting to tune. In summary:
469
468
470
-
-**Read Parallelism** - The number of threads used to read. Controlled through [`max_threads`](/operations/settings/settings#max_threads). In ClickHouse Cloud this is determined by the instance size with it defaulting to the number of vCPUs. Increasing this value may improve read performance at the expense of greater memory usage.
471
-
-**Insert Parallelism** - The number of insert threads used to insert. Controlled through [`max_insert_threads`](/operations/settings/settings#max_insert_threads). In ClickHouse Cloud this is determined by the instance size (between 2 and 4) and is set to 1 in OSS. Increasing this value may improve performance at the expense of greater memory usage.
469
+
-**Read Parallelism** - The number of threads used to read. Controlled through [`max_threads`](/operations/settings/settings#max_threads). In ClickHouse Cloud this is determined by the instance size with it defaulting to the number of vCPUs. Increasing this value may improve read performance at the expense of greater memory usage.
470
+
-**Insert Parallelism** - The number of insert threads used to insert. Controlled through [`max_insert_threads`](/operations/settings/settings#max_insert_threads). In ClickHouse Cloud this is determined by the instance size (between 2 and 4) and is set to 1 in OSS. Increasing this value may improve performance at the expense of greater memory usage.
472
471
-**Insert Block Size** - data is processed in a loop where it is pulled, parsed, and formed into in-memory insert blocks based on the [partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key). These blocks are sorted, optimized, compressed, and written to storage as new [data parts](/parts). The size of the insert block, controlled by settings [`min_insert_block_size_rows`](/operations/settings/settings#min_insert_block_size_rows) and [`min_insert_block_size_bytes`](/operations/settings/settings#min_insert_block_size_bytes) (uncompressed), impacts memory usage and disk I/O. Larger blocks use more memory but create fewer parts, reducing I/O and background merges. These settings represent minimum thresholds (whichever is reached first triggers a flush).
473
472
-**Materialized view block size** - As well as the above mechanics for the main insert, prior to insertion into materialized views, blocks are also squashed for more efficient processing. The size of these blocks is determined by the settings [`min_insert_block_size_bytes_for_materialized_views`](/operations/settings/settings#min_insert_block_size_bytes_for_materialized_views) and [`min_insert_block_size_rows_for_materialized_views`](/operations/settings/settings#min_insert_block_size_rows_for_materialized_views). Larger blocks allow more efficient processing at the expense of greater memory usage. By default, these settings revert to the values of the source table settings [`min_insert_block_size_rows`](/operations/settings/settings#min_insert_block_size_rows) and [`min_insert_block_size_bytes`](/operations/settings/settings#min_insert_block_size_bytes), respectively.
import denormalizationDiagram from '@site/static/images/data-modeling/denormalization-diagram.png';
9
+
import denormalizationSchema from '@site/static/images/data-modeling/denormalization-schema.png';
10
+
8
11
# Denormalizing Data
9
12
10
13
Data denormalization is a technique in ClickHouse to use flattened tables to help minimize query latency by avoiding joins.
@@ -15,10 +18,7 @@ Denormalizing data involves intentionally reversing the normalization process to
15
18
16
19
This process reduces the need for complex joins at query time and can significantly speed up read operations, making it ideal for applications with heavy read requirements and complex queries. However, it can increase the complexity of write operations and maintenance, as any changes to the duplicated data must be propagated across all instances to maintain consistency.
0 commit comments