Skip to content

Commit dcedcfe

Browse files
committed
managing data images
1 parent fb23006 commit dcedcfe

File tree

14 files changed

+68
-93
lines changed

14 files changed

+68
-93
lines changed

docs/cloud/bestpractices/asyncinserts.md

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ description: 'Describes how to use asynchronous inserts into ClickHouse as an al
88
import asyncInsert01 from '@site/static/images/cloud/bestpractices/async-01.png';
99
import asyncInsert02 from '@site/static/images/cloud/bestpractices/async-02.png';
1010
import asyncInsert03 from '@site/static/images/cloud/bestpractices/async-03.png';
11+
import Image from '@theme/IdealImage';
1112

1213
Inserting data into ClickHouse in large batches is a best practice. It saves compute cycles and disk I/O, and therefore it saves money. If your use case allows you to batch your inserts external to ClickHouse, then that is one option. If you would like ClickHouse to create the batches, then you can use the asynchronous INSERT mode described here.
1314

@@ -17,10 +18,7 @@ By default, ClickHouse is writing data synchronously.
1718
Each insert sent to ClickHouse causes ClickHouse to immediately create a part containing the data from the insert.
1819
This is the default behavior when the async_insert setting is set to its default value of 0:
1920

20-
<img src={asyncInsert01}
21-
class="image"
22-
alt="Asynchronous insert process - default synchronous inserts"
23-
style={{width: '100%', background: 'none'}} />
21+
<Image img={asyncInsert01} size="lg" alt="Asynchronous insert process - default synchronous inserts" background="white"/>
2422

2523
By setting async_insert to 1, ClickHouse first stores the incoming inserts into an in-memory buffer before flushing them regularly to disk.
2624

@@ -38,15 +36,9 @@ With the [wait_for_async_insert](/operations/settings/settings.md/#wait_for_asyn
3836

3937
The following two diagrams illustrate the two settings for async_insert and wait_for_async_insert:
4038

41-
<img src={asyncInsert02}
42-
class="image"
43-
alt="Asynchronous insert process - async_insert=1, wait_for_async_insert=1"
44-
style={{width: '100%', background: 'none'}} />
39+
<Image img={asyncInsert02} size="lg" alt="Asynchronous insert process - async_insert=1, wait_for_async_insert=1" background="white"/>
4540

46-
<img src={asyncInsert03}
47-
class="image"
48-
alt="Asynchronous insert process - async_insert=1, wait_for_async_insert=0"
49-
style={{width: '100%', background: 'none'}} />
41+
<Image img={asyncInsert03} size="lg" alt="Asynchronous insert process - async_insert=1, wait_for_async_insert=0" background="white"/>
5042

5143
### Enabling asynchronous inserts {#enabling-asynchronous-inserts}
5244

docs/cloud/bestpractices/partitioningkey.md

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,20 @@ title: 'Choose a Low Cardinality Partitioning Key'
55
description: 'Page describing why you should choose a low cardinality partitioning key as a best practice'
66
---
77

8+
import Image from '@theme/IdealImage';
89
import partitioning01 from '@site/static/images/cloud/bestpractices/partitioning-01.png';
910
import partitioning02 from '@site/static/images/cloud/bestpractices/partitioning-02.png';
1011

11-
# Choose a Low Cardinality Partitioning Key
12-
1312
When you send an insert statement (that should contain many rows - see [section above](/optimize/bulk-inserts)) to a table in ClickHouse Cloud, and that
1413
table is not using a [partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key.md) then all row data from that insert is written into a new part on storage:
1514

16-
<img src={partitioning01}
17-
class="image"
18-
alt="Insert without partitioning key - one part created"
19-
style={{width: '100%', background: 'none'}} />
15+
<Image img={partitioning01} size="lg" alt="Insert without partitioning key - one part created" background="white"/>
2016

2117
However, when you send an insert statement to a table in ClickHouse Cloud, and that table has a partitioning key, then ClickHouse:
2218
- checks the partitioning key values of the rows contained in the insert
2319
- creates one new part on storage per distinct partitioning key value
2420
- places the rows in the corresponding parts by partitioning key value
2521

26-
<img src={partitioning02}
27-
class="image"
28-
alt="Insert with partitioning key - multiple parts created based on partitioning key values"
29-
style={{width: '100%', background: 'none'}} />
22+
<Image img={partitioning02} size="lg" alt="Insert with partitioning key - multiple parts created based on partitioning key values" background="white"/>
3023

3124
Therefore, to minimize the number of write requests to the ClickHouse Cloud object storage, use a low cardinality partitioning key or avoid using any partitioning key for your table.

docs/data-compression/compression-modes.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ keywords: ['compression', 'codec', 'encoding', 'modes']
77
---
88

99
import CompressionBlock from '@site/static/images/data-compression/ch_compression_block.png';
10+
import Image from '@theme/IdealImage';
1011

1112
# Compression modes
1213

@@ -43,7 +44,7 @@ From [Facebook benchmarks](https://facebook.github.io/zstd/#benchmarks):
4344
| mode | byte | Compression mode |
4445
| compressed_data | binary | Block of compressed data |
4546

46-
<img src={CompressionBlock} alt="Diagram illustrating ClickHouse compression block structure" />
47+
<Image img={CompressionBlock} size="md" alt="Diagram illustrating ClickHouse compression block structure"/>
4748

4849
Header is (raw_size + data_size + mode), raw size consists of len(header + compressed_data).
4950

docs/data-modeling/backfilling.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ keywords: ['materialized views', 'backfilling', 'inserting data', 'resilient dat
66
---
77

88
import nullTableMV from '@site/static/images/data-modeling/null_table_mv.png';
9+
import Image from '@theme/IdealImage';
910

1011
# Backfilling Data
1112

@@ -420,7 +421,7 @@ The [Null table engine](/engines/table-engines/special/null) provides a storage
420421

421422
Importantly, any materialized views attached to the table engine still execute over blocks of data as its inserted - sending their results to a target table. These blocks are of a configurable size. While larger blocks can potentially be more efficient (and faster to process), they consume more resources (principally memory). Use of this table engine means we can build our materialized view incrementally i.e. a block at a time, avoiding the need to hold the entire aggregation in memory.
422423

423-
<img src={nullTableMV} class="image" alt="Denormalization in ClickHouse" style={{width: '50%', background: 'none'}} />
424+
<Image img={nullTableMV} size="md" alt="Denormalization in ClickHouse"/>
424425

425426
<br />
426427

docs/data-modeling/denormalization.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ keywords: ['data denormalization', 'denormalize', 'query optimization']
77

88
import denormalizationDiagram from '@site/static/images/data-modeling/denormalization-diagram.png';
99
import denormalizationSchema from '@site/static/images/data-modeling/denormalization-schema.png';
10+
import Image from '@theme/IdealImage';
1011

1112
# Denormalizing Data
1213

@@ -18,7 +19,7 @@ Denormalizing data involves intentionally reversing the normalization process to
1819

1920
This process reduces the need for complex joins at query time and can significantly speed up read operations, making it ideal for applications with heavy read requirements and complex queries. However, it can increase the complexity of write operations and maintenance, as any changes to the duplicated data must be propagated across all instances to maintain consistency.
2021

21-
<img src={denormalizationDiagram} class="image" alt="Denormalization in ClickHouse" style={{width: '100%', background: 'none'}} />
22+
<Image img={denormalizationDiagram} size="lg" alt="Denormalization in ClickHouse"/>
2223

2324
<br />
2425

@@ -131,7 +132,7 @@ The main observation here is that aggregated vote statistics for each post would
131132

132133
Now let's consider our `Users` and `Badges`:
133134

134-
<img src={denormalizationSchema} class="image" alt="Users and Badges schema" style={{width: '100%', background: 'none'}} />
135+
<Image img={denormalizationSchema} size="lg" alt="Users and Badges schema"/>
135136

136137
<p></p>
137138
We first insert the data with the following command:

docs/guides/best-practices/query-optimization.md

Lines changed: 36 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,12 @@ description: 'A simple guide for query optimization that describe common path to
66
---
77

88
import queryOptimizationDiagram1 from '@site/static/images/guides/best-practices/query_optimization_diagram_1.png';
9+
import Image from '@theme/IdealImage';
10+
911

1012
# A simple guide for query optimization
1113

12-
This section aims to illustrate through common scenarios how to use different performance and optimization techniques, such as [analyzer](/operations/analyzer), [query profiling](/operations/optimizing-performance/sampling-query-profiler) or [avoid Nullable Columns](/optimize/avoid-nullable-columns), in order to improve your ClickHouse query performances.
14+
This section aims to illustrate through common scenarios how to use different performance and optimization techniques, such as [analyzer](/operations/analyzer), [query profiling](/operations/optimizing-performance/sampling-query-profiler) or [avoid Nullable Columns](/optimize/avoid-nullable-columns), in order to improve your ClickHouse query performances.
1315

1416
## Understand query performance {#understand-query-performance}
1517

@@ -67,12 +69,12 @@ AS SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/n
6769

6870
-- Insert data into table with inferred schema
6971
INSERT INTO trips_small_inferred
70-
SELECT *
72+
SELECT *
7173
FROM s3Cluster
7274
('default','https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/clickhouse-academy/nyc_taxi_2009-2010.parquet');
7375
```
7476

75-
Let's have a look to the table schema automatically inferred from the data.
77+
Let's have a look to the table schema automatically inferred from the data.
7678

7779
```sql
7880
--- Display inferred table schema
@@ -98,7 +100,7 @@ CREATE TABLE nyc_taxi.trips_small_inferred
98100
`tolls_amount` Nullable(Float64),
99101
`total_amount` Nullable(Float64)
100102
)
101-
ORDER BY tuple()
103+
ORDER BY tuple()
102104
```
103105

104106
## Spot the slow queries {#spot-the-slow-queries}
@@ -111,7 +113,7 @@ For each executed query, ClickHouse logs statistics such as query execution time
111113

112114
Therefore, the query log is a good place to start when investigating slow queries. You can easily spot the queries that take a long time to execute and display the resource usage information for each one. 
113115

114-
Let’s find the top five long-running queries on our NYC taxi dataset.
116+
Let’s find the top five long-running queries on our NYC taxi dataset.
115117

116118
```sql
117119
-- Find top 5 long running queries from nyc_taxi database in the last 1 hour
@@ -153,19 +155,19 @@ Row 2:
153155
type: QueryFinish
154156
event_time: 2024-11-27 11:11:33
155157
query_duration_ms: 2026
156-
query: SELECT
158+
query: SELECT
157159
payment_type,
158160
COUNT() AS trip_count,
159161
formatReadableQuantity(SUM(trip_distance)) AS total_distance,
160162
AVG(total_amount) AS total_amount_avg,
161163
AVG(tip_amount) AS tip_amount_avg
162-
FROM
164+
FROM
163165
nyc_taxi.trips_small_inferred
164-
WHERE
166+
WHERE
165167
pickup_datetime >= '2009-01-01' AND pickup_datetime < '2009-04-01'
166-
GROUP BY
168+
GROUP BY
167169
payment_type
168-
ORDER BY
170+
ORDER BY
169171
trip_count DESC;
170172

171173
read_rows: 329044175
@@ -217,7 +219,7 @@ The field `query_duration_ms` indicates how long it took for that particular que
217219
You might also want to know which queries are stressing the system by examining the query that consumes the most memory or CPU. 
218220

219221
```sql
220-
-- Top queries by memory usage
222+
-- Top queries by memory usage
221223
SELECT
222224
type,
223225
event_time,
@@ -236,7 +238,7 @@ LIMIT 30
236238

237239
Let’s isolate the long-running queries we found and rerun them a few times to understand the response time. 
238240

239-
At this point, it is essential to turn off the filesystem cache by setting the `enable_filesystem_cache` setting to 0 to improve reproducibility.
241+
At this point, it is essential to turn off the filesystem cache by setting the `enable_filesystem_cache` setting to 0 to improve reproducibility.
240242

241243

242244
```sql
@@ -260,22 +262,22 @@ FORMAT JSON
260262
Peak memory usage: 440.24 MiB.
261263

262264
-- Run query 2
263-
SELECT
265+
SELECT
264266
payment_type,
265267
COUNT() AS trip_count,
266268
formatReadableQuantity(SUM(trip_distance)) AS total_distance,
267269
AVG(total_amount) AS total_amount_avg,
268270
AVG(tip_amount) AS tip_amount_avg
269-
FROM
271+
FROM
270272
nyc_taxi.trips_small_inferred
271-
WHERE
273+
WHERE
272274
pickup_datetime >= '2009-01-01' AND pickup_datetime < '2009-04-01'
273-
GROUP BY
275+
GROUP BY
274276
payment_type
275-
ORDER BY
277+
ORDER BY
276278
trip_count DESC;
277279

278-
---
280+
---
279281
4 rows in set. Elapsed: 1.419 sec. Processed 329.04 million rows, 5.72 GB (231.86 million rows/s., 4.03 GB/s.)
280282
Peak memory usage: 546.75 MiB.
281283

@@ -291,7 +293,7 @@ FORMAT JSON
291293
Peak memory usage: 451.53 MiB.
292294
```
293295

294-
Summarize in the table for easy reading.
296+
Summarize in the table for easy reading.
295297

296298
| Name | Elapsed | Rows processed | Peak memory |
297299
| ------- | --------- | -------------- | ----------- |
@@ -308,7 +310,7 @@ Let's understand a bit better what the queries achieve. 
308310
None of these queries are doing very complex processing, except the first query that calculates the trip time on the fly every time the query executes. However, each of these queries takes more than one second to execute, which, in the ClickHouse world, is a very long time. We can also note the memory usage of these queries; more or less 400 Mb for each query is quite a lot of memory. Also, each query appears to read the same number of rows (i.e., 329.04 million). Let's quickly confirm how many rows are in this table.
309311

310312
```sql
311-
-- Count number of rows in table
313+
-- Count number of rows in table
312314
SELECT count()
313315
FROM nyc_taxi.trips_small_inferred
314316

@@ -319,7 +321,7 @@ Query id: 733372c5-deaf-4719-94e3-261540933b23
319321
└───────────┘
320322
```
321323

322-
The table contains 329.04 million rows, therefore each query is doing a full scan of the table.
324+
The table contains 329.04 million rows, therefore each query is doing a full scan of the table.
323325

324326
### Explain statement {#explain-statement}
325327

@@ -389,7 +391,7 @@ Query id: c7e11e7b-d970-4e35-936c-ecfc24e3b879
389391

390392
Here, we can note the number of threads used to execute the query: 59 threads, which indicates a high parallelization. This speeds up the query, which would take longer to execute on a smaller machine. The number of threads running in parallel can explain the high volume of memory the query uses. 
391393

392-
Ideally, you would investigate all your slow queries the same way to identify unnecessary complex query plans and understand the number of rows read by each query and the resources consumed.
394+
Ideally, you would investigate all your slow queries the same way to identify unnecessary complex query plans and understand the number of rows read by each query and the resources consumed.
393395

394396
## Methodology {#methodology}
395397

@@ -407,7 +409,7 @@ Start by identifying your slow queries from query logs, then investigate potenti
407409
408410
Once you have identified potential optimizations, it is recommended that you implement them one by one to better track how they affect performance. Below is a diagram describing the general approach.
409411

410-
<img src={queryOptimizationDiagram1} class="image" />
412+
<Image img={queryOptimizationDiagram1} size="lg" alt="Optimization workflow"/>
411413

412414
_Finally, be cautious of outliers; it’s pretty common that a query might run slowly, either because a user tried an ad-hoc expensive query or the system was under stress for another reason. You can group by the field normalized_query_hash to identify expensive queries that are being executed regularly. Those are probably the ones you want to investigate._
413415

@@ -417,7 +419,7 @@ Now that we have our framework to test, we can start optimizing.
417419

418420
The best place to start is to look at how the data is stored. As for any database, the less data we read, the faster the query will be executed. 
419421

420-
Depending on how you ingested your data, you might have leveraged ClickHouse [capabilities](/interfaces/schema-inference) to infer the table schema based on the ingested data. While this is very practical to get started, if you want to optimize your query performance, you’ll need to review the data schema to best fit your use case.
422+
Depending on how you ingested your data, you might have leveraged ClickHouse [capabilities](/interfaces/schema-inference) to infer the table schema based on the ingested data. While this is very practical to get started, if you want to optimize your query performance, you’ll need to review the data schema to best fit your use case.
421423

422424
### Nullable {#nullable}
423425

@@ -426,7 +428,7 @@ As described in the [best practices documentation](/cloud/bestpractices/avoid-nu
426428
Running an SQL query that counts the rows with a NULL value can easily reveal the columns in your tables that actually need a Nullable value.
427429

428430
```sql
429-
-- Find non-null values columns
431+
-- Find non-null values columns
430432
SELECT
431433
countIf(vendor_id IS NULL) AS vendor_id_nulls,
432434
countIf(pickup_datetime IS NULL) AS pickup_datetime_nulls,
@@ -471,7 +473,7 @@ An easy optimization to apply to Strings is to make best use of the LowCardinali
471473

472474
An easy rule of thumb for determining which columns are good candidates for LowCardinality is that any column with less than 10,000 unique values is a perfect candidate.
473475

474-
You can use the following SQL query to find columns with a low number of unique values.
476+
You can use the following SQL query to find columns with a low number of unique values.
475477

476478
```sql
477479
-- Identify low cardinality columns
@@ -515,14 +517,14 @@ Query id: 4306a8e1-2a9c-4b06-97b4-4d902d2233eb
515517
└───────────────────┴───────────────────┘
516518
```
517519

518-
For dates, you should pick a precision that matches your dataset and is best suited to answering the queries you’re planning to run.
520+
For dates, you should pick a precision that matches your dataset and is best suited to answering the queries you’re planning to run.
519521

520522
### Apply the optimizations {#apply-the-optimizations}
521523

522-
Let’s create a new table to use the optimized schema and re-ingest the data.
524+
Let’s create a new table to use the optimized schema and re-ingest the data.
523525

524526
```sql
525-
-- Create table with optimized data
527+
-- Create table with optimized data
526528
CREATE TABLE trips_small_no_pk
527529
(
528530
`vendor_id` LowCardinality(String),
@@ -543,7 +545,7 @@ CREATE TABLE trips_small_no_pk
543545
)
544546
ORDER BY tuple();
545547

546-
-- Insert the data
548+
-- Insert the data
547549
INSERT INTO trips_small_no_pk SELECT * FROM trips_small_inferred
548550
```
549551

@@ -631,7 +633,7 @@ CREATE TABLE trips_small_pk
631633
)
632634
PRIMARY KEY (passenger_count, pickup_datetime, dropoff_datetime);
633635

634-
-- Insert the data
636+
-- Insert the data
635637
INSERT INTO trips_small_pk SELECT * FROM trips_small_inferred
636638
```
637639

@@ -741,7 +743,7 @@ We then rerun our queries. We compile the results from the three experiments to
741743

742744
We can see significant improvement across the board in execution time and memory used. 
743745

744-
Query 2 benefits most from the primary key. Let’s have a look at how the query plan generated is different from before.
746+
Query 2 benefits most from the primary key. Let’s have a look at how the query plan generated is different from before.
745747

746748
```sql
747749
EXPLAIN indexes = 1
@@ -780,6 +782,6 @@ Thanks to the primary key, only a subset of the table granules has been selected
780782

781783
## Next steps {#next-steps}
782784

783-
Hopefully this guide gets a good understanding on how to investigate slow queries with ClickHouse and how to make them faster. To explore more on this topic, you can read more about [query analyzer](/operations/analyzer) and [profiling](/operations/optimizing-performance/sampling-query-profiler) to understand better how exactly ClickHouse is executing your query.
785+
Hopefully this guide gets a good understanding on how to investigate slow queries with ClickHouse and how to make them faster. To explore more on this topic, you can read more about [query analyzer](/operations/analyzer) and [profiling](/operations/optimizing-performance/sampling-query-profiler) to understand better how exactly ClickHouse is executing your query.
784786

785-
As you get more familiar with ClickHouse specificities, I would recommend to read about [partitioning keys](/optimize/partitioning-key) and [data skipping indexes](/optimize/skipping-indexes) to learn about more advanced techniques you can use to accelerate your queries.
787+
As you get more familiar with ClickHouse specificities, I would recommend to read about [partitioning keys](/optimize/partitioning-key) and [data skipping indexes](/optimize/skipping-indexes) to learn about more advanced techniques you can use to accelerate your queries.

0 commit comments

Comments
 (0)