You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/cloud/bestpractices/asyncinserts.md
+4-12Lines changed: 4 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,7 @@ description: 'Describes how to use asynchronous inserts into ClickHouse as an al
8
8
import asyncInsert01 from '@site/static/images/cloud/bestpractices/async-01.png';
9
9
import asyncInsert02 from '@site/static/images/cloud/bestpractices/async-02.png';
10
10
import asyncInsert03 from '@site/static/images/cloud/bestpractices/async-03.png';
11
+
import Image from '@theme/IdealImage';
11
12
12
13
Inserting data into ClickHouse in large batches is a best practice. It saves compute cycles and disk I/O, and therefore it saves money. If your use case allows you to batch your inserts external to ClickHouse, then that is one option. If you would like ClickHouse to create the batches, then you can use the asynchronous INSERT mode described here.
13
14
@@ -17,10 +18,7 @@ By default, ClickHouse is writing data synchronously.
17
18
Each insert sent to ClickHouse causes ClickHouse to immediately create a part containing the data from the insert.
18
19
This is the default behavior when the async_insert setting is set to its default value of 0:
19
20
20
-
<img src={asyncInsert01}
21
-
class="image"
22
-
alt="Asynchronous insert process - default synchronous inserts"
23
-
style={{width: '100%', background: 'none'}} />
21
+
<Imageimg={asyncInsert01}size="lg"alt="Asynchronous insert process - default synchronous inserts"background="white"/>
24
22
25
23
By setting async_insert to 1, ClickHouse first stores the incoming inserts into an in-memory buffer before flushing them regularly to disk.
26
24
@@ -38,15 +36,9 @@ With the [wait_for_async_insert](/operations/settings/settings.md/#wait_for_asyn
38
36
39
37
The following two diagrams illustrate the two settings for async_insert and wait_for_async_insert:
40
38
41
-
<img src={asyncInsert02}
42
-
class="image"
43
-
alt="Asynchronous insert process - async_insert=1, wait_for_async_insert=1"
44
-
style={{width: '100%', background: 'none'}} />
39
+
<Imageimg={asyncInsert02}size="lg"alt="Asynchronous insert process - async_insert=1, wait_for_async_insert=1"background="white"/>
45
40
46
-
<img src={asyncInsert03}
47
-
class="image"
48
-
alt="Asynchronous insert process - async_insert=1, wait_for_async_insert=0"
49
-
style={{width: '100%', background: 'none'}} />
41
+
<Imageimg={asyncInsert03}size="lg"alt="Asynchronous insert process - async_insert=1, wait_for_async_insert=0"background="white"/>
description: 'Page describing why you should choose a low cardinality partitioning key as a best practice'
6
6
---
7
7
8
+
import Image from '@theme/IdealImage';
8
9
import partitioning01 from '@site/static/images/cloud/bestpractices/partitioning-01.png';
9
10
import partitioning02 from '@site/static/images/cloud/bestpractices/partitioning-02.png';
10
11
11
-
# Choose a Low Cardinality Partitioning Key
12
-
13
12
When you send an insert statement (that should contain many rows - see [section above](/optimize/bulk-inserts)) to a table in ClickHouse Cloud, and that
14
13
table is not using a [partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key.md) then all row data from that insert is written into a new part on storage:
15
14
16
-
<img src={partitioning01}
17
-
class="image"
18
-
alt="Insert without partitioning key - one part created"
19
-
style={{width: '100%', background: 'none'}} />
15
+
<Imageimg={partitioning01}size="lg"alt="Insert without partitioning key - one part created"background="white"/>
20
16
21
17
However, when you send an insert statement to a table in ClickHouse Cloud, and that table has a partitioning key, then ClickHouse:
22
18
- checks the partitioning key values of the rows contained in the insert
23
19
- creates one new part on storage per distinct partitioning key value
24
20
- places the rows in the corresponding parts by partitioning key value
25
21
26
-
<img src={partitioning02}
27
-
class="image"
28
-
alt="Insert with partitioning key - multiple parts created based on partitioning key values"
29
-
style={{width: '100%', background: 'none'}} />
22
+
<Imageimg={partitioning02}size="lg"alt="Insert with partitioning key - multiple parts created based on partitioning key values"background="white"/>
30
23
31
24
Therefore, to minimize the number of write requests to the ClickHouse Cloud object storage, use a low cardinality partitioning key or avoid using any partitioning key for your table.
import nullTableMV from '@site/static/images/data-modeling/null_table_mv.png';
9
+
import Image from '@theme/IdealImage';
9
10
10
11
# Backfilling Data
11
12
@@ -420,7 +421,7 @@ The [Null table engine](/engines/table-engines/special/null) provides a storage
420
421
421
422
Importantly, any materialized views attached to the table engine still execute over blocks of data as its inserted - sending their results to a target table. These blocks are of a configurable size. While larger blocks can potentially be more efficient (and faster to process), they consume more resources (principally memory). Use of this table engine means we can build our materialized view incrementally i.e. a block at a time, avoiding the need to hold the entire aggregation in memory.
import denormalizationDiagram from '@site/static/images/data-modeling/denormalization-diagram.png';
9
9
import denormalizationSchema from '@site/static/images/data-modeling/denormalization-schema.png';
10
+
import Image from '@theme/IdealImage';
10
11
11
12
# Denormalizing Data
12
13
@@ -18,7 +19,7 @@ Denormalizing data involves intentionally reversing the normalization process to
18
19
19
20
This process reduces the need for complex joins at query time and can significantly speed up read operations, making it ideal for applications with heavy read requirements and complex queries. However, it can increase the complexity of write operations and maintenance, as any changes to the duplicated data must be propagated across all instances to maintain consistency.
Copy file name to clipboardExpand all lines: docs/guides/best-practices/query-optimization.md
+36-34Lines changed: 36 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,10 +6,12 @@ description: 'A simple guide for query optimization that describe common path to
6
6
---
7
7
8
8
import queryOptimizationDiagram1 from '@site/static/images/guides/best-practices/query_optimization_diagram_1.png';
9
+
import Image from '@theme/IdealImage';
10
+
9
11
10
12
# A simple guide for query optimization
11
13
12
-
This section aims to illustrate through common scenarios how to use different performance and optimization techniques, such as [analyzer](/operations/analyzer), [query profiling](/operations/optimizing-performance/sampling-query-profiler) or [avoid Nullable Columns](/optimize/avoid-nullable-columns), in order to improve your ClickHouse query performances.
14
+
This section aims to illustrate through common scenarios how to use different performance and optimization techniques, such as [analyzer](/operations/analyzer), [query profiling](/operations/optimizing-performance/sampling-query-profiler) or [avoid Nullable Columns](/optimize/avoid-nullable-columns), in order to improve your ClickHouse query performances.
@@ -111,7 +113,7 @@ For each executed query, ClickHouse logs statistics such as query execution time
111
113
112
114
Therefore, the query log is a good place to start when investigating slow queries. You can easily spot the queries that take a long time to execute and display the resource usage information for each one.
113
115
114
-
Let’s find the top five long-running queries on our NYC taxi dataset.
116
+
Let’s find the top five long-running queries on our NYC taxi dataset.
115
117
116
118
```sql
117
119
-- Find top 5 long running queries from nyc_taxi database in the last 1 hour
@@ -153,19 +155,19 @@ Row 2:
153
155
type: QueryFinish
154
156
event_time: 2024-11-2711:11:33
155
157
query_duration_ms: 2026
156
-
query: SELECT
158
+
query: SELECT
157
159
payment_type,
158
160
COUNT() AS trip_count,
159
161
formatReadableQuantity(SUM(trip_distance)) AS total_distance,
@@ -308,7 +310,7 @@ Let's understand a bit better what the queries achieve.
308
310
None of these queries are doing very complex processing, except the first query that calculates the trip time on the fly every time the query executes. However, each of these queries takes more than one second to execute, which, in the ClickHouse world, is a very long time. We can also note the memory usage of these queries; more or less 400 Mb for each query is quite a lot of memory. Also, each query appears to read the same number of rows (i.e., 329.04 million). Let's quickly confirm how many rows are in this table.
Here, we can note the number of threads used to execute the query: 59 threads, which indicates a high parallelization. This speeds up the query, which would take longer to execute on a smaller machine. The number of threads running in parallel can explain the high volume of memory the query uses.
391
393
392
-
Ideally, you would investigate all your slow queries the same way to identify unnecessary complex query plans and understand the number of rows read by each query and the resources consumed.
394
+
Ideally, you would investigate all your slow queries the same way to identify unnecessary complex query plans and understand the number of rows read by each query and the resources consumed.
393
395
394
396
## Methodology {#methodology}
395
397
@@ -407,7 +409,7 @@ Start by identifying your slow queries from query logs, then investigate potenti
407
409
408
410
Once you have identified potential optimizations, it is recommended that you implement them one by one to better track how they affect performance. Below is a diagram describing the general approach.
_Finally, be cautious of outliers; it’s pretty common that a query might run slowly, either because a user tried an ad-hoc expensive query or the system was under stress for another reason. You can group by the field normalized_query_hash to identify expensive queries that are being executed regularly. Those are probably the ones you want to investigate._
413
415
@@ -417,7 +419,7 @@ Now that we have our framework to test, we can start optimizing.
417
419
418
420
The best place to start is to look at how the data is stored. As for any database, the less data we read, the faster the query will be executed.
419
421
420
-
Depending on how you ingested your data, you might have leveraged ClickHouse [capabilities](/interfaces/schema-inference) to infer the table schema based on the ingested data. While this is very practical to get started, if you want to optimize your query performance, you’ll need to review the data schema to best fit your use case.
422
+
Depending on how you ingested your data, you might have leveraged ClickHouse [capabilities](/interfaces/schema-inference) to infer the table schema based on the ingested data. While this is very practical to get started, if you want to optimize your query performance, you’ll need to review the data schema to best fit your use case.
421
423
422
424
### Nullable {#nullable}
423
425
@@ -426,7 +428,7 @@ As described in the [best practices documentation](/cloud/bestpractices/avoid-nu
426
428
Running an SQL query that counts the rows with a NULL value can easily reveal the columns in your tables that actually need a Nullable value.
427
429
428
430
```sql
429
-
-- Find non-null values columns
431
+
-- Find non-null values columns
430
432
SELECT
431
433
countIf(vendor_id IS NULL) AS vendor_id_nulls,
432
434
countIf(pickup_datetime IS NULL) AS pickup_datetime_nulls,
@@ -471,7 +473,7 @@ An easy optimization to apply to Strings is to make best use of the LowCardinali
471
473
472
474
An easy rule of thumb for determining which columns are good candidates for LowCardinality is that any column with less than 10,000 unique values is a perfect candidate.
473
475
474
-
You can use the following SQL query to find columns with a low number of unique values.
476
+
You can use the following SQL query to find columns with a low number of unique values.
INSERT INTO trips_small_pk SELECT*FROM trips_small_inferred
636
638
```
637
639
@@ -741,7 +743,7 @@ We then rerun our queries. We compile the results from the three experiments to
741
743
742
744
We can see significant improvement across the board in execution time and memory used.
743
745
744
-
Query 2 benefits most from the primary key. Let’s have a look at how the query plan generated is different from before.
746
+
Query 2 benefits most from the primary key. Let’s have a look at how the query plan generated is different from before.
745
747
746
748
```sql
747
749
EXPLAIN indexes =1
@@ -780,6 +782,6 @@ Thanks to the primary key, only a subset of the table granules has been selected
780
782
781
783
## Next steps {#next-steps}
782
784
783
-
Hopefully this guide gets a good understanding on how to investigate slow queries with ClickHouse and how to make them faster. To explore more on this topic, you can read more about [query analyzer](/operations/analyzer) and [profiling](/operations/optimizing-performance/sampling-query-profiler) to understand better how exactly ClickHouse is executing your query.
785
+
Hopefully this guide gets a good understanding on how to investigate slow queries with ClickHouse and how to make them faster. To explore more on this topic, you can read more about [query analyzer](/operations/analyzer) and [profiling](/operations/optimizing-performance/sampling-query-profiler) to understand better how exactly ClickHouse is executing your query.
784
786
785
-
As you get more familiar with ClickHouse specificities, I would recommend to read about [partitioning keys](/optimize/partitioning-key) and [data skipping indexes](/optimize/skipping-indexes) to learn about more advanced techniques you can use to accelerate your queries.
787
+
As you get more familiar with ClickHouse specificities, I would recommend to read about [partitioning keys](/optimize/partitioning-key) and [data skipping indexes](/optimize/skipping-indexes) to learn about more advanced techniques you can use to accelerate your queries.
0 commit comments