Skip to content

Commit df2f939

Browse files
authored
Merge pull request #4563 from jangdan/minor
More minor improvements
2 parents 64e36cd + 4d9d8e2 commit df2f939

18 files changed

+88
-88
lines changed

docs/best-practices/_snippets/_async_inserts.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
import Image from '@theme/IdealImage';
22
import async_inserts from '@site/static/images/bestpractices/async_inserts.png';
33

4-
Asynchronous inserts in ClickHouse provide a powerful alternative when client-side batching isn't feasible. This is especially valuable in observability workloads, where hundreds or thousands of agents send data continuously - logs, metrics, traces - often in small, real-time payloads. Buffering data client-side in these environments increases complexity, requiring a centralized queue to ensure sufficiently large batches can be sent.
4+
Asynchronous inserts in ClickHouse provide a powerful alternative when client-side batching isn't feasible. This is especially valuable in observability workloads, where hundreds or thousands of agents send data continuouslylogs, metrics, tracesoften in small, real-time payloads. Buffering data client-side in these environments increases complexity, requiring a centralized queue to ensure sufficiently large batches can be sent.
55

66
:::note
77
Sending many small batches in synchronous mode is not recommended, leading to many parts being created. This will lead to poor query performance and ["too many part"](/knowledgebase/exception-too-many-parts) errors.
88
:::
99

10-
Asynchronous inserts shift batching responsibility from the client to the server by writing incoming data to an in-memory buffer, then flushing it to storage based on configurable thresholds. This approach significantly reduces part creation overhead, lowers CPU usage, and ensures ingestion remains efficient - even under high concurrency.
10+
Asynchronous inserts shift batching responsibility from the client to the server by writing incoming data to an in-memory buffer, then flushing it to storage based on configurable thresholds. This approach significantly reduces part creation overhead, lowers CPU usage, and ensures ingestion remains efficienteven under high concurrency.
1111

1212
The core behavior is controlled via the [`async_insert`](/operations/settings/settings#async_insert) setting.
1313

@@ -19,15 +19,15 @@ When enabled (1), inserts are buffered and only written to disk once one of the
1919
(2) a time threshold elapses (async_insert_busy_timeout_ms) or
2020
(3) a maximum number of insert queries accumulate (async_insert_max_query_number).
2121

22-
This batching process is invisible to clients and helps ClickHouse efficiently merge insert traffic from multiple sources. However, until a flush occurs, the data cannot be queried. Importantly, there are multiple buffers per insert shape and settings combination, and in clusters, buffers are maintained per node - enabling fine-grained control across multi-tenant environments. Insert mechanics are otherwise identical to those described for [synchronous inserts](/best-practices/selecting-an-insert-strategy#synchronous-inserts-by-default).
22+
This batching process is invisible to clients and helps ClickHouse efficiently merge insert traffic from multiple sources. However, until a flush occurs, the data cannot be queried. Importantly, there are multiple buffers per insert shape and settings combination, and in clusters, buffers are maintained per nodeenabling fine-grained control across multi-tenant environments. Insert mechanics are otherwise identical to those described for [synchronous inserts](/best-practices/selecting-an-insert-strategy#synchronous-inserts-by-default).
2323

2424
### Choosing a return mode {#choosing-a-return-mode}
2525

2626
The behavior of asynchronous inserts is further refined using the [`wait_for_async_insert`](/operations/settings/settings#wait_for_async_insert) setting.
2727

2828
When set to 1 (the default), ClickHouse only acknowledges the insert after the data is successfully flushed to disk. This ensures strong durability guarantees and makes error handling straightforward: if something goes wrong during the flush, the error is returned to the client. This mode is recommended for most production scenarios, especially when insert failures must be tracked reliably.
2929

30-
[Benchmarks](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse) show it scales well with concurrency - whether you're running 200 or 500 clients- thanks to adaptive inserts and stable part creation behavior.
30+
[Benchmarks](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse) show it scales well with concurrencywhether you're running 200 or 500 clientsthanks to adaptive inserts and stable part creation behavior.
3131

3232
Setting `wait_for_async_insert = 0` enables "fire-and-forget" mode. Here, the server acknowledges the insert as soon as the data is buffered, without waiting for it to reach storage.
3333

@@ -39,9 +39,9 @@ Our strong recommendation is to use `async_insert=1,wait_for_async_insert=1` if
3939

4040
### Deduplication and reliability {#deduplication-and-reliability}
4141

42-
By default, ClickHouse performs automatic deduplication for synchronous inserts, which makes retries safe in failure scenarios. However, this is disabled for asynchronous inserts unless explicitly enabled (this should not be enabled if you have dependent materialized views - [see issue](https://github.com/ClickHouse/ClickHouse/issues/66003)).
42+
By default, ClickHouse performs automatic deduplication for synchronous inserts, which makes retries safe in failure scenarios. However, this is disabled for asynchronous inserts unless explicitly enabled (this should not be enabled if you have dependent materialized views[see issue](https://github.com/ClickHouse/ClickHouse/issues/66003)).
4343

44-
In practice, if deduplication is turned on and the same insert is retried - due to, for instance, a timeout or network drop - ClickHouse can safely ignore the duplicate. This helps maintain idempotency and avoids double-writing data. Still, it's worth noting that insert validation and schema parsing happen only during buffer flush - so errors (like type mismatches) will only surface at that point.
44+
In practice, if deduplication is turned on and the same insert is retrieddue to, for instance, a timeout or network dropClickHouse can safely ignore the duplicate. This helps maintain idempotency and avoids double-writing data. Still, it's worth noting that insert validation and schema parsing happen only during buffer flushso errors (like type mismatches) will only surface at that point.
4545

4646
### Enabling asynchronous inserts {#enabling-asynchronous-inserts}
4747

@@ -57,7 +57,7 @@ Asynchronous inserts can be enabled for a particular user, or for a specific que
5757
```
5858
- You can also specify asynchronous insert settings as connection parameters when using a ClickHouse programming language client.
5959

60-
As an example, this is how you can do that within a JDBC connection string when you use the ClickHouse Java JDBC driver for connecting to ClickHouse Cloud :
60+
As an example, this is how you can do that within a JDBC connection string when you use the ClickHouse Java JDBC driver for connecting to ClickHouse Cloud:
6161
```bash
6262
"jdbc:ch://HOST.clickhouse.cloud:8443/?user=default&password=PASSWORD&ssl=true&custom_http_params=async_insert=1,wait_for_async_insert=1"
6363
```

docs/best-practices/_snippets/_avoid_mutations.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
In ClickHouse, **mutations** refer to operations that modify or delete existing data in a table - typically using `ALTER TABLE ... DELETE` or `ALTER TABLE ... UPDATE`. While these statements may appear similar to standard SQL operations, they are fundamentally different under the hood.
1+
In ClickHouse, **mutations** refer to operations that modify or delete existing data in a tabletypically using `ALTER TABLE ... DELETE` or `ALTER TABLE ... UPDATE`. While these statements may appear similar to standard SQL operations, they are fundamentally different under the hood.
22

3-
Rather than modifying rows in place, mutations in ClickHouse are asynchronous background processes that rewrite entire [data parts](/parts) affected by the change. This approach is necessary due to ClickHouse's column-oriented, immutable storage model, but it can lead to significant I/O and resource usage.
3+
Rather than modifying rows in place, mutations in ClickHouse are asynchronous background processes that rewrite entire [data parts](/parts) affected by the change. This approach is necessary due to ClickHouse's column-oriented, immutable storage model, and it can lead to significant I/O and resource usage.
44

55
When a mutation is issued, ClickHouse schedules the creation of new **mutated parts**, leaving the original parts untouched until the new ones are ready. Once ready, the mutated parts atomically replace the originals. However, because the operation rewrites entire parts, even a minor change (such as updating a single row) may result in large-scale rewrites and excessive write amplification.
66

7-
For large datasets, this can produce a substantial spike in disk I/O and degrade overall cluster performance. Unlike merges, mutations can't be rolled back once submitted and will continue to execute even after server restarts unless explicitly cancelled - see [`KILL MUTATION`](/sql-reference/statements/kill#kill-mutation).
7+
For large datasets, this can produce a substantial spike in disk I/O and degrade overall cluster performance. Unlike merges, mutations can't be rolled back once submitted and will continue to execute even after server restarts unless explicitly cancelledsee [`KILL MUTATION`](/sql-reference/statements/kill#kill-mutation).
88

99
Mutations are **totally ordered**: they apply to data inserted before the mutation was issued, while newer data remains unaffected. They do not block inserts but can still overlap with other ongoing queries. A SELECT running during a mutation may read a mix of mutated and unmutated parts, which can lead to inconsistent views of the data during execution. ClickHouse executes mutations in parallel per part, which can further intensify memory and CPU usage, especially when complex subqueries (like x IN (SELECT ...)) are involved.
1010

docs/best-practices/_snippets/_avoid_nullable_columns.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,4 @@ ENGINE = MergeTree
2525
ORDER BY x
2626
```
2727

28-
Consider your use case, a default value may be inappropriate.
28+
Consider your use case; a default value may be inappropriate.

docs/best-practices/_snippets/_avoid_optimize_final.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ Normally, ClickHouse avoids merging parts larger than ~150 GB (configurable via
4444

4545
* It may try to merge **multiple 150 GB parts** into one massive part
4646
* This could result in **long merge times**, **memory pressure**, or even **out-of-memory errors**
47-
* These large parts may become challenging to merge i.e. attempts to merge them further fails for the reasons stated above. In cases where merges are required for correct query time behavior, this can result in undesired consequences e.g. [duplicates accumulating for a ReplacingMergeTree](/guides/developer/deduplication#using-replacingmergetree-for-upserts), increasing query time performance.
47+
* These large parts may become challenging to merge, i.e. attempts to merge them further fails for the reasons stated above. In cases where merges are required for correct query time behavior, this can result in undesired consequences such as [duplicates accumulating for a ReplacingMergeTree](/guides/developer/deduplication#using-replacingmergetree-for-upserts), diminishing query time performance.
4848

4949
## Let background merges do the work {#let-background-merges-do-the-work}
5050

docs/best-practices/_snippets/_bulk_inserts.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,5 @@ We recommend inserting data in batches of at least 1,000 rows, and ideally betwe
77
If you're unable to batch data client-side, ClickHouse supports asynchronous inserts that shift batching to the server ([see](/best-practices/selecting-an-insert-strategy#asynchronous-inserts)).
88

99
:::tip
10-
Regardless of the size of your inserts, we recommend keeping the number of insert queries around one insert query per second. The reason for that recommendation is that the created parts are merged to larger parts in the background (in order to optimize your data for read queries), and sending too many insert queries per second can lead to situations where the background merging can't keep up with the number of new parts. However, you can use a higher rate of insert queries per second when you use asynchronous inserts (see asynchronous inserts).
10+
Regardless of the size of your inserts, we recommend keeping the number of insert queries around one insert query per second. The reason for this recommendation is that the created parts are merged to larger parts in the background (in order to optimize your data for read queries), and sending too many insert queries per second can lead to situations where the background merging can't keep up with the number of new parts. However, you can use a higher rate of insert queries per second when you use asynchronous inserts (see asynchronous inserts).
1111
:::

docs/best-practices/json_type.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,23 +25,23 @@ If your data structure is known and consistent, there is rarely a need for the J
2525
* **Predictable nesting**: use Tuple, Array, or Nested types for these structures.
2626
* **Predictable structure with varying types**: consider Dynamic or Variant types instead.
2727

28-
You can also mix approaches - for example, use static columns for predictable top-level fields and a single JSON column for a dynamic section of the payload.
28+
You can also mix approachesfor example, use static columns for predictable top-level fields and a single JSON column for a dynamic section of the payload.
2929

3030
## Considerations and tips for using JSON {#considerations-and-tips-for-using-json}
3131

3232
The JSON type enables efficient columnar storage by flattening paths into subcolumns. But with flexibility comes responsibility. To use it effectively:
3333

34-
* **Specify path types** using [hints in the column definition](/sql-reference/data-types/newjson) to specify types for known sub columns, avoiding unnecessary type inference.
34+
* **Specify path types** using [hints in the column definition](/sql-reference/data-types/newjson) to specify types for known subcolumns, avoiding unnecessary type inference.
3535
* **Skip paths** if you don't need the values, with [SKIP and SKIP REGEXP](/sql-reference/data-types/newjson) to reduce storage and improve performance.
36-
* **Avoid setting [`max_dynamic_paths`](/sql-reference/data-types/newjson#reaching-the-limit-of-dynamic-paths-inside-json) too high** - large values increase resource consumption and reduce efficiency. As a rule of thumb, keep it below 10,000.
36+
* **Avoid setting [`max_dynamic_paths`](/sql-reference/data-types/newjson#reaching-the-limit-of-dynamic-paths-inside-json) too high**large values increase resource consumption and reduce efficiency. As a rule of thumb, keep it below 10,000.
3737

3838
:::note Type hints
39-
Type hints offer more than just a way to avoid unnecessary type inference - they eliminate storage and processing indirection entirely. JSON paths with type hints are always stored just like traditional columns, bypassing the need for [**discriminator columns**](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#storage-extension-for-dynamically-changing-data) or dynamic resolution during query time. This means that with well-defined type hints, nested JSON fields achieve the same performance and efficiency as if they were modeled as top-level fields from the outset. As a result, for datasets that are mostly consistent but still benefit from the flexibility of JSON, type hints provide a convenient way to preserve performance without needing to restructure your schema or ingest pipeline.
39+
Type hints offer more than just a way to avoid unnecessary type inferencethey eliminate storage and processing indirection entirely. JSON paths with type hints are always stored just like traditional columns, bypassing the need for [**discriminator columns**](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#storage-extension-for-dynamically-changing-data) or dynamic resolution during query time. This means that with well-defined type hints, nested JSON fields achieve the same performance and efficiency as if they were modeled as top-level fields from the outset. As a result, for datasets that are mostly consistent but still benefit from the flexibility of JSON, type hints provide a convenient way to preserve performance without needing to restructure your schema or ingest pipeline.
4040
:::
4141

4242
## Advanced features {#advanced-features}
4343

44-
* JSON columns **can be used in primary keys** like any other columns. Codecs cannot be specified for a sub-column.
44+
* JSON columns **can be used in primary keys** like any other columns. Codecs cannot be specified for a subcolumn.
4545
* They support introspection via functions like [`JSONAllPathsWithTypes()` and `JSONDynamicPaths()`](/sql-reference/data-types/newjson#introspection-functions).
4646
* You can read nested sub-objects using the `.^` syntax.
4747
* Query syntax may differ from standard SQL and may require special casting or operators for nested fields.
@@ -156,7 +156,7 @@ INSERT INTO arxiv FORMAT JSONEachRow
156156
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]]}
157157
```
158158

159-
Suppose another column called `tags` is added. If this was simply a list of strings we could model as an `Array(String)`, but let's assume users can add arbitrary tag structures with mixed types (notice score is a string or integer). Our modified JSON document:
159+
Suppose another column called `tags` is added. If this was simply a list of strings we could model this as an `Array(String)`, but let's assume users can add arbitrary tag structures with mixed types (notice `score` is a string or integer). Our modified JSON document:
160160

161161
```sql
162162
{
@@ -222,7 +222,7 @@ ORDER BY doc.update_date
222222
```
223223

224224
:::note
225-
We provide a type hint for the `update_date` column in the JSON definition, as we use it in the ordering/primary key. This helps ClickHouse to know that this column won't be null and ensures it knows which `update_date` sub-column to use (there may be multiple for each type, so this is ambiguous otherwise).
225+
We provide a type hint for the `update_date` column in the JSON definition, as we use it in the ordering/primary key. This helps ClickHouse to know that this column won't be null and ensures it knows which `update_date` subcolumn to use (there may be multiple for each type, so this is ambiguous otherwise).
226226
:::
227227

228228
We can insert into this table and view the subsequently inferred schema using the [`JSONAllPathsWithTypes`](/sql-reference/functions/json-functions#JSONAllPathsWithTypes) function and [`PrettyJSONEachRow`](/interfaces/formats/PrettyJSONEachRow) output format:
@@ -295,7 +295,7 @@ INSERT INTO arxiv FORMAT JSONEachRow
295295
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}}
296296
```
297297

298-
We can now infer the types of the sub column tags.
298+
We can now infer the types of the subcolumn `tags`.
299299

300300
```sql
301301
SELECT JSONAllPathsWithTypes(tags)

0 commit comments

Comments
 (0)