Skip to content

Commit 3b7ec56

Browse files
committed
Update using_data_skipping_indices.md
1 parent b4bfbc9 commit 3b7ec56

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

docs/best-practices/using_data_skipping_indices.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,16 @@ Data skipping indices should be considered when previous best practices have bee
1717

1818
These types of indices can be used to accelerate query performance if used carefully with an understanding of how they work.
1919

20-
ClickHouse provides a powerful mechanism called **data skipping indices** that can dramatically reduce the amount of data scanned during query execution - particularly when the primary key isn't helpful for a specific filter condition. Unlike traditional databases that rely on row-based secondary indexes (like B-trees), ClickHouse is a column-store and doesn't store row locations in a way that supports such structures. Instead, it uses skip indexes, which help it avoid reading blocks of data guaranteed not to match a query's filtering conditions.
20+
ClickHouse provides a powerful mechanism called **data skipping indices** that can dramatically reduce the amount of data scanned during query execution particularly when the primary key isn't helpful for a specific filter condition. Unlike traditional databases that rely on row-based secondary indexes (like B-trees), ClickHouse is a column-store and doesn't store row locations in a way that supports such structures. Instead, it uses skip indexes, which help it avoid reading blocks of data guaranteed not to match a query's filtering conditions.
2121

22-
Skip indexes work by storing metadata about blocks of data - such as min/max values, value sets, or Bloom filter representations- and using this metadata during query execution to determine which data blocks can be skipped entirely. They apply only to the [MergeTree family](/engines/table-engines/mergetree-family/mergetree) of table engines and are defined using an expression, an index type, a name, and a granularity that defines the size of each indexed block. These indexes are stored alongside the table data and are consulted when the query filter matches the index expression.
22+
Skip indexes work by storing metadata about blocks of data such as min/max values, value sets, or Bloom filter representations and using this metadata during query execution to determine which data blocks can be skipped entirely. They apply only to the [MergeTree family](/engines/table-engines/mergetree-family/mergetree) of table engines and are defined using an expression, an index type, a name, and a granularity that defines the size of each indexed block. These indexes are stored alongside the table data and are consulted when the query filter matches the index expression.
2323

2424
There are several types of data skipping indexes, each suited to different types of queries and data distributions:
2525

2626
* **minmax**: Tracks the minimum and maximum value of an expression per block. Ideal for range queries on loosely sorted data.
2727
* **set(N)**: Tracks a set of values up to a specified size N for each block. Effective on columns with low cardinality per blocks.
2828
* **bloom_filter**: Probabilistically determines if a value exists in a block, allowing fast approximate filtering for set membership. Effective for optimizing queries looking for the “needle in a haystack”, where a positive match is needed.
29-
* **tokenbf_v1 / ngrambf_v1**: Specialized Bloom filter variants designed for searching tokens or character sequences in strings - particularly useful for log data or text search use cases.
29+
* **tokenbf_v1 / ngrambf_v1**: Specialized Bloom filter variants designed for searching tokens or character sequences in strings particularly useful for log data or text search use cases.
3030

3131
While powerful, skip indexes must be used with care. They only provide benefit when they eliminate a meaningful number of data blocks, and can actually introduce overhead if the query or data structure doesn't align. If even a single matching value exists in a block, that entire block must still be read.
3232

@@ -40,12 +40,12 @@ In general, data skipping indices are best applied after ensuring proper primary
4040

4141
Always:
4242

43-
1. test skip indexes on real data with realistic queries. Try different index types and granularity values.
43+
1. Test skip indexes on real data with realistic queries. Try different index types and granularity values.
4444
2. Evaluate their impact using tools like send_logs_level='trace' and `EXPLAIN indexes=1` to view index effectiveness.
45-
3. Always evaluate the size of an index and how it is impacted by granularity. Reducing granularity size often will improve performance to a point, resulting in more granules being filtered and needing to be scanned. However, as index size increases with lower granularity performance can also degrade. Measure the performance and index size for various granularity data points. This is particularly pertinent on bloom filter indexes.
45+
3. Always evaluate the size of an index and how it is impacted by granularity. Reducing granularity size often will improve performance to a point, resulting in more granules being filtered and needing to be scanned. However, as index size increases with lower granularity, performance can also degrade. Measure the performance and index size for various granularity data points. This is particularly pertinent on bloom filter indexes.
4646

4747
<p/>
48-
**When used appropriately, skip indexes can provide a substantial performance boost - when used blindly, they can add unnecessary cost.**
48+
**When used appropriately, skip indexes can provide a substantial performance boost when used blindly, they can add unnecessary cost.**
4949

5050
For a more detailed guide on Data Skipping Indices see [here](/sql-reference/statements/alter/skipping-index).
5151

@@ -98,7 +98,7 @@ WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000)
9898
1 row in set. Elapsed: 0.720 sec. Processed 59.55 million rows, 230.23 MB (82.66 million rows/s., 319.56 MB/s.)
9999
```
100100

101-
This query is able to exclude some of the rows (and granules) using the primary index. However, the majority of rows still need to be read as indicated by the above response and following `EXPLAIN indexes=1`:
101+
This query is able to exclude some of the rows (and granules) using the primary index. However, the majority of rows still need to be read as indicated by the above response and the following `EXPLAIN indexes = 1`:
102102

103103
```sql
104104
EXPLAIN indexes = 1
@@ -138,13 +138,13 @@ LIMIT 1
138138
25 rows in set. Elapsed: 0.070 sec.
139139
```
140140

141-
A simple analysis shows that `ViewCount` is correlated with the `CreationDate` (a primary key) as one might expect - the longer a post exists, the more time it has to be viewed.
141+
A simple analysis shows that `ViewCount` is correlated with the `CreationDate` (a primary key) as one might expect the longer a post exists, the more time it has to be viewed.
142142

143143
```sql
144144
SELECT toDate(CreationDate) AS day, avg(ViewCount) AS view_count FROM stackoverflow.posts WHERE day > '2009-01-01' GROUP BY day
145145
```
146146

147-
This therefore makes a logical choice for a data skipping index. Given the numeric type, a min_max index makes sense. We add an index using the following `ALTER TABLE` commands - first adding it, then "materializing it".
147+
This therefore makes a logical choice for a data skipping index. Given the numeric type, a minmax index makes sense. We add an index using the following `ALTER TABLE` commands first adding it, then "materializing it".
148148

149149
```sql
150150
ALTER TABLE stackoverflow.posts
@@ -153,7 +153,7 @@ ALTER TABLE stackoverflow.posts
153153
ALTER TABLE stackoverflow.posts MATERIALIZE INDEX view_count_idx;
154154
```
155155

156-
This index could have also been added during initial table creation. The schema with the min max index defined as part of the DDL:
156+
This index could have also been added during initial table creation. The schema with the minmax index defined as part of the DDL:
157157

158158
```sql
159159
CREATE TABLE stackoverflow.posts
@@ -191,7 +191,7 @@ The following animation illustrates how our minmax skipping index is built for t
191191

192192
<Image img={building_skipping_indices} size="lg" alt="Building skipping indices"/>
193193

194-
Repeating our earlier query shows significant performance improvements. Notice all the reduced number of rows scanned:
194+
Repeating our earlier query shows significant performance improvements. Notice the reduced number of rows scanned:
195195

196196
```sql
197197
SELECT count()
@@ -205,7 +205,7 @@ WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000)
205205
1 row in set. Elapsed: 0.012 sec. Processed 39.11 thousand rows, 321.39 KB (3.40 million rows/s., 27.93 MB/s.)
206206
```
207207

208-
An `EXPLAIN indexes=1` confirms use of the index.
208+
An `EXPLAIN indexes = 1` confirms use of the index.
209209

210210
```sql
211211
EXPLAIN indexes = 1

0 commit comments

Comments
 (0)