Update using_data_skipping_indices.md

jangdan · jangdan · commit 3b7ec5655446 · 2025-10-12T21:42:34.000+09:00
diff --git a/docs/best-practices/using_data_skipping_indices.md b/docs/best-practices/using_data_skipping_indices.md
@@ -17,16 +17,16 @@ Data skipping indices should be considered when previous best practices have bee
 
 These types of indices can be used to accelerate query performance if used carefully with an understanding of how they work.
 
-ClickHouse provides a powerful mechanism called **data skipping indices** that can dramatically reduce the amount of data scanned during query execution - particularly when the primary key isn't helpful for a specific filter condition. Unlike traditional databases that rely on row-based secondary indexes (like B-trees), ClickHouse is a column-store and doesn't store row locations in a way that supports such structures. Instead, it uses skip indexes, which help it avoid reading blocks of data guaranteed not to match a query's filtering conditions.
+ClickHouse provides a powerful mechanism called **data skipping indices** that can dramatically reduce the amount of data scanned during query execution — particularly when the primary key isn't helpful for a specific filter condition. Unlike traditional databases that rely on row-based secondary indexes (like B-trees), ClickHouse is a column-store and doesn't store row locations in a way that supports such structures. Instead, it uses skip indexes, which help it avoid reading blocks of data guaranteed not to match a query's filtering conditions.
 
-Skip indexes work by storing metadata about blocks of data - such as min/max values, value sets, or Bloom filter representations- and using this metadata during query execution to determine which data blocks can be skipped entirely. They apply only to the [MergeTree family](/engines/table-engines/mergetree-family/mergetree) of table engines and are defined using an expression, an index type, a name, and a granularity that defines the size of each indexed block. These indexes are stored alongside the table data and are consulted when the query filter matches the index expression.
+Skip indexes work by storing metadata about blocks of data — such as min/max values, value sets, or Bloom filter representations — and using this metadata during query execution to determine which data blocks can be skipped entirely. They apply only to the [MergeTree family](/engines/table-engines/mergetree-family/mergetree) of table engines and are defined using an expression, an index type, a name, and a granularity that defines the size of each indexed block. These indexes are stored alongside the table data and are consulted when the query filter matches the index expression.
 
 There are several types of data skipping indexes, each suited to different types of queries and data distributions:
 
 * **minmax**: Tracks the minimum and maximum value of an expression per block. Ideal for range queries on loosely sorted data.
 * **set(N)**: Tracks a set of values up to a specified size N for each block. Effective on columns with low cardinality per blocks.
 * **bloom_filter**: Probabilistically determines if a value exists in a block, allowing fast approximate filtering for set membership. Effective for optimizing queries looking for the “needle in a haystack”, where a positive match is needed.
-* **tokenbf_v1 / ngrambf_v1**: Specialized Bloom filter variants designed for searching tokens or character sequences in strings - particularly useful for log data or text search use cases.
+* **tokenbf_v1 / ngrambf_v1**: Specialized Bloom filter variants designed for searching tokens or character sequences in strings — particularly useful for log data or text search use cases.
 
 While powerful, skip indexes must be used with care. They only provide benefit when they eliminate a meaningful number of data blocks, and can actually introduce overhead if the query or data structure doesn't align. If even a single matching value exists in a block, that entire block must still be read.
 
@@ -40,12 +40,12 @@ In general, data skipping indices are best applied after ensuring proper primary
 
 Always:
 
-1. test skip indexes on real data with realistic queries. Try different index types and granularity values.
+1. Test skip indexes on real data with realistic queries. Try different index types and granularity values.
 2. Evaluate their impact using tools like send_logs_level='trace' and `EXPLAIN indexes=1` to view index effectiveness. 
-3. Always evaluate the size of an index and how it is impacted by granularity. Reducing granularity size often will improve performance to a point, resulting in more granules being filtered and needing to be scanned. However, as index size increases with lower granularity performance can also degrade. Measure the performance and index size for various granularity data points. This is particularly pertinent on bloom filter indexes.
+3. Always evaluate the size of an index and how it is impacted by granularity. Reducing granularity size often will improve performance to a point, resulting in more granules being filtered and needing to be scanned. However, as index size increases with lower granularity, performance can also degrade. Measure the performance and index size for various granularity data points. This is particularly pertinent on bloom filter indexes.
 
 <p/>
-**When used appropriately, skip indexes can provide a substantial performance boost - when used blindly, they can add unnecessary cost.**
+**When used appropriately, skip indexes can provide a substantial performance boost — when used blindly, they can add unnecessary cost.**
 
 For a more detailed guide on Data Skipping Indices see [here](/sql-reference/statements/alter/skipping-index).
 
@@ -98,7 +98,7 @@ WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000)
 1 row in set. Elapsed: 0.720 sec. Processed 59.55 million rows, 230.23 MB (82.66 million rows/s., 319.56 MB/s.)
 ```
 
-This query is able to exclude some of the rows (and granules) using the primary index. However, the majority of rows still need to be read as indicated by the above response and following `EXPLAIN indexes=1`:
+This query is able to exclude some of the rows (and granules) using the primary index. However, the majority of rows still need to be read as indicated by the above response and the following `EXPLAIN indexes = 1`:
 
 ```sql
 EXPLAIN indexes = 1
@@ -138,13 +138,13 @@ LIMIT 1
 25 rows in set. Elapsed: 0.070 sec.
 ```
 
-A simple analysis shows that `ViewCount` is correlated with the `CreationDate` (a primary key) as one might expect - the longer a post exists, the more time it has to be viewed.
+A simple analysis shows that `ViewCount` is correlated with the `CreationDate` (a primary key) as one might expect — the longer a post exists, the more time it has to be viewed.
 
 ```sql
 SELECT toDate(CreationDate) AS day, avg(ViewCount) AS view_count FROM stackoverflow.posts WHERE day > '2009-01-01'  GROUP BY day
 ```
 
-This therefore makes a logical choice for a data skipping index. Given the numeric type, a min_max index makes sense. We add an index using the following `ALTER TABLE` commands - first adding it, then "materializing it".
+This therefore makes a logical choice for a data skipping index. Given the numeric type, a minmax index makes sense. We add an index using the following `ALTER TABLE` commands — first adding it, then "materializing it".
 
 ```sql
 ALTER TABLE stackoverflow.posts
@@ -153,7 +153,7 @@ ALTER TABLE stackoverflow.posts
 ALTER TABLE stackoverflow.posts MATERIALIZE INDEX view_count_idx;
 ```
 
-This index could have also been added during initial table creation. The schema with the min max index defined as part of the DDL:
+This index could have also been added during initial table creation. The schema with the minmax index defined as part of the DDL:
 
 ```sql
 CREATE TABLE stackoverflow.posts
@@ -191,7 +191,7 @@ The following animation illustrates how our minmax skipping index is built for t
 
 <Image img={building_skipping_indices} size="lg" alt="Building skipping indices"/>
 
-Repeating our earlier query shows significant performance improvements. Notice all the reduced number of rows scanned:
+Repeating our earlier query shows significant performance improvements. Notice the reduced number of rows scanned:
 
 ```sql
 SELECT count()
@@ -205,7 +205,7 @@ WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000)
 1 row in set. Elapsed: 0.012 sec. Processed 39.11 thousand rows, 321.39 KB (3.40 million rows/s., 27.93 MB/s.)
 ```
 
-An `EXPLAIN indexes=1` confirms use of the index.
+An `EXPLAIN indexes = 1` confirms use of the index.
 
 ```sql
 EXPLAIN indexes = 1