You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/best-practices/using_data_skipping_indices.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,16 +17,16 @@ Data skipping indices should be considered when previous best practices have bee
17
17
18
18
These types of indices can be used to accelerate query performance if used carefully with an understanding of how they work.
19
19
20
-
ClickHouse provides a powerful mechanism called **data skipping indices** that can dramatically reduce the amount of data scanned during query execution - particularly when the primary key isn't helpful for a specific filter condition. Unlike traditional databases that rely on row-based secondary indexes (like B-trees), ClickHouse is a column-store and doesn't store row locations in a way that supports such structures. Instead, it uses skip indexes, which help it avoid reading blocks of data guaranteed not to match a query's filtering conditions.
20
+
ClickHouse provides a powerful mechanism called **data skipping indices** that can dramatically reduce the amount of data scanned during query execution — particularly when the primary key isn't helpful for a specific filter condition. Unlike traditional databases that rely on row-based secondary indexes (like B-trees), ClickHouse is a column-store and doesn't store row locations in a way that supports such structures. Instead, it uses skip indexes, which help it avoid reading blocks of data guaranteed not to match a query's filtering conditions.
21
21
22
-
Skip indexes work by storing metadata about blocks of data - such as min/max values, value sets, or Bloom filter representations- and using this metadata during query execution to determine which data blocks can be skipped entirely. They apply only to the [MergeTree family](/engines/table-engines/mergetree-family/mergetree) of table engines and are defined using an expression, an index type, a name, and a granularity that defines the size of each indexed block. These indexes are stored alongside the table data and are consulted when the query filter matches the index expression.
22
+
Skip indexes work by storing metadata about blocks of data — such as min/max values, value sets, or Bloom filter representations — and using this metadata during query execution to determine which data blocks can be skipped entirely. They apply only to the [MergeTree family](/engines/table-engines/mergetree-family/mergetree) of table engines and are defined using an expression, an index type, a name, and a granularity that defines the size of each indexed block. These indexes are stored alongside the table data and are consulted when the query filter matches the index expression.
23
23
24
24
There are several types of data skipping indexes, each suited to different types of queries and data distributions:
25
25
26
26
***minmax**: Tracks the minimum and maximum value of an expression per block. Ideal for range queries on loosely sorted data.
27
27
***set(N)**: Tracks a set of values up to a specified size N for each block. Effective on columns with low cardinality per blocks.
28
28
***bloom_filter**: Probabilistically determines if a value exists in a block, allowing fast approximate filtering for set membership. Effective for optimizing queries looking for the “needle in a haystack”, where a positive match is needed.
29
-
***tokenbf_v1 / ngrambf_v1**: Specialized Bloom filter variants designed for searching tokens or character sequences in strings - particularly useful for log data or text search use cases.
29
+
***tokenbf_v1 / ngrambf_v1**: Specialized Bloom filter variants designed for searching tokens or character sequences in strings — particularly useful for log data or text search use cases.
30
30
31
31
While powerful, skip indexes must be used with care. They only provide benefit when they eliminate a meaningful number of data blocks, and can actually introduce overhead if the query or data structure doesn't align. If even a single matching value exists in a block, that entire block must still be read.
32
32
@@ -40,12 +40,12 @@ In general, data skipping indices are best applied after ensuring proper primary
40
40
41
41
Always:
42
42
43
-
1.test skip indexes on real data with realistic queries. Try different index types and granularity values.
43
+
1.Test skip indexes on real data with realistic queries. Try different index types and granularity values.
44
44
2. Evaluate their impact using tools like send_logs_level='trace' and `EXPLAIN indexes=1` to view index effectiveness.
45
-
3. Always evaluate the size of an index and how it is impacted by granularity. Reducing granularity size often will improve performance to a point, resulting in more granules being filtered and needing to be scanned. However, as index size increases with lower granularity performance can also degrade. Measure the performance and index size for various granularity data points. This is particularly pertinent on bloom filter indexes.
45
+
3. Always evaluate the size of an index and how it is impacted by granularity. Reducing granularity size often will improve performance to a point, resulting in more granules being filtered and needing to be scanned. However, as index size increases with lower granularity, performance can also degrade. Measure the performance and index size for various granularity data points. This is particularly pertinent on bloom filter indexes.
46
46
47
47
<p/>
48
-
**When used appropriately, skip indexes can provide a substantial performance boost - when used blindly, they can add unnecessary cost.**
48
+
**When used appropriately, skip indexes can provide a substantial performance boost — when used blindly, they can add unnecessary cost.**
49
49
50
50
For a more detailed guide on Data Skipping Indices see [here](/sql-reference/statements/alter/skipping-index).
51
51
@@ -98,7 +98,7 @@ WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000)
98
98
1 row inset. Elapsed: 0.720 sec. Processed 59.55 million rows, 230.23 MB (82.66 million rows/s., 319.56 MB/s.)
99
99
```
100
100
101
-
This query is able to exclude some of the rows (and granules) using the primary index. However, the majority of rows still need to be read as indicated by the above response and following `EXPLAIN indexes=1`:
101
+
This query is able to exclude some of the rows (and granules) using the primary index. However, the majority of rows still need to be read as indicated by the above response and the following `EXPLAIN indexes = 1`:
102
102
103
103
```sql
104
104
EXPLAIN indexes =1
@@ -138,13 +138,13 @@ LIMIT 1
138
138
25 rows inset. Elapsed: 0.070 sec.
139
139
```
140
140
141
-
A simple analysis shows that `ViewCount` is correlated with the `CreationDate` (a primary key) as one might expect - the longer a post exists, the more time it has to be viewed.
141
+
A simple analysis shows that `ViewCount` is correlated with the `CreationDate` (a primary key) as one might expect — the longer a post exists, the more time it has to be viewed.
142
142
143
143
```sql
144
144
SELECT toDate(CreationDate) AS day, avg(ViewCount) AS view_count FROMstackoverflow.postsWHERE day >'2009-01-01'GROUP BY day
145
145
```
146
146
147
-
This therefore makes a logical choice for a data skipping index. Given the numeric type, a min_max index makes sense. We add an index using the following `ALTER TABLE` commands - first adding it, then "materializing it".
147
+
This therefore makes a logical choice for a data skipping index. Given the numeric type, a minmax index makes sense. We add an index using the following `ALTER TABLE` commands — first adding it, then "materializing it".
148
148
149
149
```sql
150
150
ALTERTABLEstackoverflow.posts
@@ -153,7 +153,7 @@ ALTER TABLE stackoverflow.posts
153
153
ALTERTABLEstackoverflow.posts MATERIALIZE INDEX view_count_idx;
154
154
```
155
155
156
-
This index could have also been added during initial table creation. The schema with the min max index defined as part of the DDL:
156
+
This index could have also been added during initial table creation. The schema with the minmax index defined as part of the DDL:
157
157
158
158
```sql
159
159
CREATETABLEstackoverflow.posts
@@ -191,7 +191,7 @@ The following animation illustrates how our minmax skipping index is built for t
0 commit comments