Skip to content

Commit 487d2cf

Browse files
authored
Merge pull request #3407 from ClickHouse/translate_ja
move more images to static
2 parents 94ca9a9 + 33dd8ec commit 487d2cf

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+112
-113
lines changed

docs/data-modeling/backfilling.md

Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ description: How to use backfill large datasets in ClickHouse
55
keywords: [materialized views, backfilling, inserting data, resilient data load]
66
---
77

8+
import nullTableMV from '@site/static/images/data-modeling/null_table_mv.png';
9+
810
# Backfilling Data
911

1012
Whether new to ClickHouse or responsible for an existing deployment, users will invariably need to backfill tables with historical data. In some cases, this is relatively simple but can become more complex when materialized views need to be populated. This guide documents some processes for this task that users can apply to their use case.
@@ -15,7 +17,7 @@ This guide assumes users are already familiar with the concept of [Incremental M
1517

1618
## Example dataset {#example-dataset}
1719

18-
Throughout this guide, we use a PyPI dataset. Each row in this dataset represents a Python package download using a tool such as `pip`.
20+
Throughout this guide, we use a PyPI dataset. Each row in this dataset represents a Python package download using a tool such as `pip`.
1921

2022
For example, the subset covers a single day - `2024-12-17` and is available publicly at `https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/2024-12-17/`. Users can query with:
2123

@@ -66,12 +68,12 @@ The full PyPI dataset, consisting of over 1 trillion rows, is available in our p
6668

6769
## Backfilling scenarios {#backfilling-scenarios}
6870

69-
Backfilling is typically needed when a stream of data is being consumed from a point in time. This data is being inserted into ClickHouse tables with [incremental materialized views](/materialized-view/incremental-materialized-view), triggering on blocks as they are inserted. These views may be transforming the data prior to insert or computing aggregates and sending results to target tables for later use in downstream applications.
71+
Backfilling is typically needed when a stream of data is being consumed from a point in time. This data is being inserted into ClickHouse tables with [incremental materialized views](/materialized-view/incremental-materialized-view), triggering on blocks as they are inserted. These views may be transforming the data prior to insert or computing aggregates and sending results to target tables for later use in downstream applications.
7072

7173
We will attempt to cover the following scenarios:
7274

7375
1. **Backfilling data with existing data ingestion** - New data is being loaded, and historical data needs to be backfilled. This historical data has been identified.
74-
2. **Adding materialized views to existing tables** - New materialized views need to be added to a setup for which historical data has been populated and data is already streaming.
76+
2. **Adding materialized views to existing tables** - New materialized views need to be added to a setup for which historical data has been populated and data is already streaming.
7577

7678
We assume data will be backfilled from object storage. In all cases, we aim to avoid pauses in data insertion.
7779

@@ -141,7 +143,7 @@ FROM pypi_downloads
141143
Peak memory usage: 682.38 KiB.
142144
```
143145

144-
Suppose we wish to load another subset `{101..200}`. While we could insert directly into `pypi`, we can do this backfill in isolation by creating duplicate tables.
146+
Suppose we wish to load another subset `{101..200}`. While we could insert directly into `pypi`, we can do this backfill in isolation by creating duplicate tables.
145147

146148
Should the backfill fail, we have not impacted our main tables and can simply [truncate](/managing-data/truncate) our duplicate tables and repeat.
147149

@@ -236,9 +238,9 @@ FROM pypi_v2
236238

237239
Importantly, the `MOVE PARTITION` operation is both lightweight (exploiting hard links) and atomic, i.e. it either fails or succeeds with no intermediate state.
238240

239-
We exploit this process heavily in our backfilling scenarios below.
241+
We exploit this process heavily in our backfilling scenarios below.
240242

241-
Notice how this process requires users to choose the size of each insert operation.
243+
Notice how this process requires users to choose the size of each insert operation.
242244

243245
Larger inserts i.e. more rows, will mean fewer `MOVE PARTITION` operations are required. However, this must be balanced against the cost in the event of an insert failure e.g. due to network interruption, to recover. Users can complement this process with batching files to reduce the risk. This can be performed with either range queries e.g. `WHERE timestamp BETWEEN 2024-12-17 09:00:00 AND 2024-12-17 10:00:00` or glob patterns. For example,
244246

@@ -258,7 +260,7 @@ ClickPipes uses this approach when loading data from object storage, automatical
258260

259261
## Scenario 1: Backfilling data with existing data ingestion {#scenario-1-backfilling-data-with-existing-data-ingestion}
260262

261-
In this scenario, we assume that the data to backfill is not in an isolated bucket and thus filtering is required. Data is already inserting and a timestamp or monotonically increasing column can be identified from which historical data needs to be backfilled.
263+
In this scenario, we assume that the data to backfill is not in an isolated bucket and thus filtering is required. Data is already inserting and a timestamp or monotonically increasing column can be identified from which historical data needs to be backfilled.
262264

263265
This process follows the following steps:
264266

@@ -317,7 +319,7 @@ ALTER TABLE pypi_downloads
317319
If the historical data is an isolated bucket, the above time filter is not required. If a time or monotonic column is unavailable, isolate your historical data.
318320

319321
:::note Just use ClickPipes in ClickHouse Cloud
320-
ClickHouse Cloud users should use ClickPipes for restoring historical backups if the data can be isolated in its own bucket (and a filter is not required). As well as parallelizing the load with multiple workers, thus reducing the load time, ClickPipes automates the above process - creating duplicate tables for both the main table and materialized views.
322+
ClickHouse Cloud users should use ClickPipes for restoring historical backups if the data can be isolated in its own bucket (and a filter is not required). As well as parallelizing the load with multiple workers, thus reducing the load time, ClickPipes automates the above process - creating duplicate tables for both the main table and materialized views.
321323
:::
322324

323325
## Scenario 2: Adding materialized views to existing tables {#scenario-2-adding-materialized-views-to-existing-tables}
@@ -339,7 +341,7 @@ Our simplest approach involves the following steps:
339341

340342
This can be further enhanced to target subsets of data in step (2) and/or use a duplicate target table for the materialized view (attach partitions to the original once the insert is complete) for easier recovery after failure.
341343

342-
Consider the following materialized view, which computes the most popular projects per hour.
344+
Consider the following materialized view, which computes the most popular projects per hour.
343345

344346
```sql
345347
CREATE TABLE pypi_downloads_per_day
@@ -372,7 +374,7 @@ AS SELECT
372374
project, count() AS count
373375
FROM pypi WHERE timestamp >= '2024-12-17 09:00:00'
374376
GROUP BY hour, project
375-
```
377+
```
376378

377379
Once this view is added, we can backfill all data for the materialized view prior to this data.
378380

@@ -403,7 +405,7 @@ In our case, this is a relatively lightweight aggregation that completes in unde
403405

404406
Often materialized view's query can be more complex (not uncommon as otherwise users wouldn't use a view!) and consume resources. In rarer cases, the resources for the query are beyond that of the server. This highlights one of the advantages of ClickHouse materialized views - they are incremental and don't process the entire dataset in one go!
405407

406-
In this case, users have several options:
408+
In this case, users have several options:
407409

408410
1. Modify your query to backfill ranges e.g. `WHERE timestamp BETWEEN 2024-12-17 08:00:00 AND 2024-12-17 09:00:00`, `WHERE timestamp BETWEEN 2024-12-17 07:00:00 AND 2024-12-17 08:00:00` etc.
409411
2. Use a [Null table engine](/engines/table-engines/special/null) to fill the materialized view. This replicates the typical incremental population of a materialized view, executing it's query over blocks of data (of configurable size).
@@ -418,10 +420,7 @@ The [Null table engine](/engines/table-engines/special/null) provides a storage
418420

419421
Importantly, any materialized views attached to the table engine still execute over blocks of data as its inserted - sending their results to a target table. These blocks are of a configurable size. While larger blocks can potentially be more efficient (and faster to process), they consume more resources (principally memory). Use of this table engine means we can build our materialized view incrementally i.e. a block at a time, avoiding the need to hold the entire aggregation in memory.
420422

421-
<img src={require('./images/null_table_mv.png').default}
422-
class='image'
423-
alt='Denormalization in ClickHouse'
424-
style={{width: '50%', background: 'none' }} />
423+
<img src={nullTableMV} class="image" alt="Denormalization in ClickHouse" style={{width: '50%', background: 'none'}} />
425424

426425
<br />
427426

@@ -449,7 +448,7 @@ GROUP BY
449448
Here, we create a Null table, `pypi_v2,` to receive the rows that will be used to build our materialized view. Note how we limit the schema to only the columns we need. Our materialized view performs an aggregation over rows inserted into this table (one block at a time), sending the results to our target table, `pypi_downloads_per_day`.
450449

451450
:::note
452-
We have used `pypi_downloads_per_day` as our target table here. For additional resiliency, users could create a duplicate table, `pypi_downloads_per_day_v2`, and use this as the target table of the view, as shown in previous examples. On completion of the insert, partitions in `pypi_downloads_per_day_v2` could, in turn, be moved to `pypi_downloads_per_day.` This would allow recovery in the case our insert fails due to memory issues or server interruptions i.e. we just truncate `pypi_downloads_per_day_v2`, tune settings, and retry.
451+
We have used `pypi_downloads_per_day` as our target table here. For additional resiliency, users could create a duplicate table, `pypi_downloads_per_day_v2`, and use this as the target table of the view, as shown in previous examples. On completion of the insert, partitions in `pypi_downloads_per_day_v2` could, in turn, be moved to `pypi_downloads_per_day.` This would allow recovery in the case our insert fails due to memory issues or server interruptions i.e. we just truncate `pypi_downloads_per_day_v2`, tune settings, and retry.
453452
:::
454453

455454
To populate this materialized view, we simply insert the relevant data to backfill into `pypi_v2` from `pypi.`
@@ -467,8 +466,8 @@ Notice our memory usage here is `639.47 MiB`.
467466

468467
Several factors will determine the performance and resources used in the above scenario. We recommend readers understand insert mechanics documented in detail [here](/integrations/s3/performance#using-threads-for-reads) prior to attempting to tune. In summary:
469468

470-
- **Read Parallelism** - The number of threads used to read. Controlled through [`max_threads`](/operations/settings/settings#max_threads). In ClickHouse Cloud this is determined by the instance size with it defaulting to the number of vCPUs. Increasing this value may improve read performance at the expense of greater memory usage.
471-
- **Insert Parallelism** - The number of insert threads used to insert. Controlled through [`max_insert_threads`](/operations/settings/settings#max_insert_threads). In ClickHouse Cloud this is determined by the instance size (between 2 and 4) and is set to 1 in OSS. Increasing this value may improve performance at the expense of greater memory usage.
469+
- **Read Parallelism** - The number of threads used to read. Controlled through [`max_threads`](/operations/settings/settings#max_threads). In ClickHouse Cloud this is determined by the instance size with it defaulting to the number of vCPUs. Increasing this value may improve read performance at the expense of greater memory usage.
470+
- **Insert Parallelism** - The number of insert threads used to insert. Controlled through [`max_insert_threads`](/operations/settings/settings#max_insert_threads). In ClickHouse Cloud this is determined by the instance size (between 2 and 4) and is set to 1 in OSS. Increasing this value may improve performance at the expense of greater memory usage.
472471
- **Insert Block Size** - data is processed in a loop where it is pulled, parsed, and formed into in-memory insert blocks based on the [partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key). These blocks are sorted, optimized, compressed, and written to storage as new [data parts](/parts). The size of the insert block, controlled by settings [`min_insert_block_size_rows`](/operations/settings/settings#min_insert_block_size_rows) and [`min_insert_block_size_bytes`](/operations/settings/settings#min_insert_block_size_bytes) (uncompressed), impacts memory usage and disk I/O. Larger blocks use more memory but create fewer parts, reducing I/O and background merges. These settings represent minimum thresholds (whichever is reached first triggers a flush).
473472
- **Materialized view block size** - As well as the above mechanics for the main insert, prior to insertion into materialized views, blocks are also squashed for more efficient processing. The size of these blocks is determined by the settings [`min_insert_block_size_bytes_for_materialized_views`](/operations/settings/settings#min_insert_block_size_bytes_for_materialized_views) and [`min_insert_block_size_rows_for_materialized_views`](/operations/settings/settings#min_insert_block_size_rows_for_materialized_views). Larger blocks allow more efficient processing at the expense of greater memory usage. By default, these settings revert to the values of the source table settings [`min_insert_block_size_rows`](/operations/settings/settings#min_insert_block_size_rows) and [`min_insert_block_size_bytes`](/operations/settings/settings#min_insert_block_size_bytes), respectively.
474473

docs/data-modeling/denormalization.md

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ description: How to use denormalization to improve query performance
55
keywords: [data denormalization, denormalize, query optimization]
66
---
77

8+
import denormalizationDiagram from '@site/static/images/data-modeling/denormalization-diagram.png';
9+
import denormalizationSchema from '@site/static/images/data-modeling/denormalization-schema.png';
10+
811
# Denormalizing Data
912

1013
Data denormalization is a technique in ClickHouse to use flattened tables to help minimize query latency by avoiding joins.
@@ -15,10 +18,7 @@ Denormalizing data involves intentionally reversing the normalization process to
1518

1619
This process reduces the need for complex joins at query time and can significantly speed up read operations, making it ideal for applications with heavy read requirements and complex queries. However, it can increase the complexity of write operations and maintenance, as any changes to the duplicated data must be propagated across all instances to maintain consistency.
1720

18-
<img src={require('./images/denormalization-diagram.png').default}
19-
class='image'
20-
alt='Denormalization in ClickHouse'
21-
style={{width: '100%', background: 'none' }} />
21+
<img src={denormalizationDiagram} class="image" alt="Denormalization in ClickHouse" style={{width: '100%', background: 'none'}} />
2222

2323
<br />
2424

@@ -131,10 +131,7 @@ The main observation here is that aggregated vote statistics for each post would
131131

132132
Now let's consider our `Users` and `Badges`:
133133

134-
<img src={require('./images/denormalization-schema.png').default}
135-
class='image'
136-
alt='Users and Badges schema'
137-
style={{width: '100%', background: 'none' }} />
134+
<img src={denormalizationSchema} class="image" alt="Users and Badges schema" style={{width: '100%', background: 'none'}} />
138135

139136
<p></p>
140137
We first insert the data with the following command:

0 commit comments

Comments
 (0)