Skip to content

Commit 8ab041b

Browse files
authored
Merge pull request #4473 from ClickHouse/3763-skipping-index-examples
Adding Data Skipping Examples Crosslinking pages related
2 parents 1becfaa + a9c0730 commit 8ab041b

File tree

5 files changed

+245
-2
lines changed

5 files changed

+245
-2
lines changed

docs/best-practices/using_data_skipping_indices.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ import Image from '@theme/IdealImage';
1313
import building_skipping_indices from '@site/static/images/bestpractices/building_skipping_indices.gif';
1414
import using_skipping_indices from '@site/static/images/bestpractices/using_skipping_indices.gif';
1515

16-
Data skipping indices should be considered when previous best practices have been followed i.e. types are optimized, a good primary key has been selected and materialized views have been exploited.
16+
Data skipping indices should be considered when previous best practices have been followed i.e. types are optimized, a good primary key has been selected and materialized views have been exploited. If you're new to skipping indices, [this guide](/optimize/skipping-indexes) is a good place to start.
1717

1818
These types of indices can be used to accelerate query performance if used carefully with an understanding of how they work.
1919

@@ -251,3 +251,9 @@ WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000)
251251
We also show an animation how the minmax skipping index prunes all row blocks that cannot possibly contain matches for the `ViewCount` > 10,000,000 predicate in our example query:
252252

253253
<Image img={using_skipping_indices} size="lg" alt="Using skipping indices"/>
254+
255+
## Related docs {#related-docs}
256+
- [Data skipping indices guide](/optimize/skipping-indexes)
257+
- [Data skipping index examples](/optimize/skipping-indexes/examples)
258+
- [Manipulating data skipping indices](/sql-reference/statements/alter/skipping-index)
259+
- [System table information](/operations/system-tables/data_skipping_indices)
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
slug: /optimize/skipping-indexes/examples
3+
sidebar_label: 'Data Skipping Indexes - Examples'
4+
sidebar_position: 2
5+
description: 'Consolidated Skip Index Examples'
6+
title: 'Data Skipping Index Examples'
7+
doc_type: 'guide'
8+
---
9+
10+
# Data skipping index examples {#data-skipping-index-examples}
11+
12+
This page consolidates ClickHouse data skipping index examples, showing how to declare each type, when to use them, and how to verify they're applied. All features work with [MergeTree-family tables](/engines/table-engines/mergetree-family/mergetree).
13+
14+
**Index syntax:**
15+
16+
```sql
17+
INDEX name expr TYPE type(...) [GRANULARITY N]`
18+
19+
ClickHouse supports five skip index types:
20+
21+
| Index Type | Description |
22+
|------------|-------------|
23+
| **minmax** | Tracks minimum and maximum values in each granule |
24+
| **set(N)** | Stores up to N distinct values per granule |
25+
| **bloom_filter([false_positive_rate])** | Probabilistic filter for existence checks |
26+
| **ngrambf_v1** | N-gram bloom filter for substring searches |
27+
| **tokenbf_v1** | Token-based bloom filter for full-text searches |
28+
29+
Each section provides examples with sample data and demonstrates how to verify index usage in query execution.
30+
31+
## MinMax index {#minmax-index}
32+
33+
The`minmax` index is best for range predicates on loosely sorted data or columns correlated with `ORDER BY`.
34+
35+
```sql
36+
-- Define in CREATE TABLE
37+
CREATE TABLE events
38+
(
39+
ts DateTime,
40+
user_id UInt64,
41+
value UInt32,
42+
INDEX ts_minmax ts TYPE minmax GRANULARITY 1
43+
)
44+
ENGINE=MergeTree
45+
ORDER BY ts;
46+
47+
-- Or add later and materialize
48+
ALTER TABLE events ADD INDEX ts_minmax ts TYPE minmax GRANULARITY 1;
49+
ALTER TABLE events MATERIALIZE INDEX ts_minmax;
50+
51+
-- Query that benefits from the index
52+
SELECT count() FROM events WHERE ts >= now() - 3600;
53+
54+
-- Verify usage
55+
EXPLAIN indexes = 1
56+
SELECT count() FROM events WHERE ts >= now() - 3600;
57+
```
58+
59+
See a [worked example](/best-practices/use-data-skipping-indices-where-appropriate#example) with `EXPLAIN` and pruning.
60+
61+
## Set index {#set-index}
62+
63+
Use the `set` index when local (per-block) cardinality is low; not helpful if each block has many distinct values.
64+
65+
```sql
66+
ALTER TABLE events ADD INDEX user_set user_id TYPE set(100) GRANULARITY 1;
67+
ALTER TABLE events MATERIALIZE INDEX user_set;
68+
69+
SELECT * FROM events WHERE user_id IN (101, 202);
70+
71+
EXPLAIN indexes = 1
72+
SELECT * FROM events WHERE user_id IN (101, 202);
73+
```
74+
75+
A creation/materialization workflow and the before/after effect are shown in the [basic operation guide](/optimize/skipping-indexes#basic-operation).
76+
77+
## Generic Bloom filter (scalar) {#generic-bloom-filter-scalar}
78+
79+
The `bloom_filter` index is good for "needle in a haystack" equality/IN membership. It accepts an optional parameter which is the false-positive rate (default 0.025).
80+
81+
```sql
82+
ALTER TABLE events ADD INDEX value_bf value TYPE bloom_filter(0.01) GRANULARITY 3;
83+
ALTER TABLE events MATERIALIZE INDEX value_bf;
84+
85+
SELECT * FROM events WHERE value IN (7, 42, 99);
86+
87+
EXPLAIN indexes = 1
88+
SELECT * FROM events WHERE value IN (7, 42, 99);
89+
```
90+
91+
## N-gram Bloom filter (ngrambf\_v1) for substring search {#n-gram-bloom-filter-ngrambf-v1-for-substring-search}
92+
93+
The `ngrambf_v1` index splits strings into n-grams. It works well for `LIKE '%...%'` queries. It supports String/FixedString/Map (via mapKeys/mapValues), as well as tunable size, hash count, and seed. See the documentation for [N-gram bloom filter](/engines/table-engines/mergetree-family/mergetree#n-gram-bloom-filter) for further details.
94+
95+
```sql
96+
-- Create index for substring search
97+
ALTER TABLE logs ADD INDEX msg_ngram msg TYPE ngrambf_v1(3, 10000, 3, 7) GRANULARITY 1;
98+
ALTER TABLE logs MATERIALIZE INDEX msg_ngram;
99+
100+
-- Substring search
101+
SELECT count() FROM logs WHERE msg LIKE '%timeout%';
102+
103+
EXPLAIN indexes = 1
104+
SELECT count() FROM logs WHERE msg LIKE '%timeout%';
105+
```
106+
107+
[This guide](/use-cases/observability/schema-design#bloom-filters-for-text-search) shows practical examples and when to use token vs ngram.
108+
109+
**Parameter optimization helpers:**
110+
111+
The four ngrambf\_v1 parameters (n-gram size, bitmap size, hash functions, seed) significantly impact performance and memory usage. Use these functions to calculate optimal bitmap size and hash function count based on your expected n-gram volume and desired false positive rate:
112+
113+
```sql
114+
CREATE FUNCTION bfEstimateFunctions AS
115+
(total_grams, bits) -> round((bits / total_grams) * log(2));
116+
117+
CREATE FUNCTION bfEstimateBmSize AS
118+
(total_grams, p_false) -> ceil((total_grams * log(p_false)) / log(1 / pow(2, log(2))));
119+
120+
-- Example sizing for 4300 ngrams, p_false = 0.0001
121+
SELECT bfEstimateBmSize(4300, 0.0001) / 8 AS size_bytes; -- ~10304
122+
SELECT bfEstimateFunctions(4300, bfEstimateBmSize(4300, 0.0001)) AS k; -- ~13
123+
```
124+
125+
See [parameter docs](/engines/table-engines/mergetree-family/mergetree#n-gram-bloom-filter) for complete tuning guidance.
126+
127+
## Token Bloom filter (tokenbf\_v1) for word-based search {#token-bloom-filter-tokenbf-v1-for-word-based-search}
128+
129+
`tokenbf_v1` indexes tokens separated by non-alphanumeric characters. You should use it with [`hasToken`](/sql-reference/functions/string-search-functions#hastoken), `LIKE` word patterns or equals/IN. It supports `String`/`FixedString`/`Map` types.
130+
131+
See [Token bloom filter](/engines/table-engines/mergetree-family/mergetree#token-bloom-filter) and [Bloom filter types](/optimize/skipping-indexes#skip-index-types) pages for more details.
132+
133+
```sql
134+
ALTER TABLE logs ADD INDEX msg_token lower(msg) TYPE tokenbf_v1(10000, 7, 7) GRANULARITY 1;
135+
ALTER TABLE logs MATERIALIZE INDEX msg_token;
136+
137+
-- Word search (case-insensitive via lower)
138+
SELECT count() FROM logs WHERE hasToken(lower(msg), 'exception');
139+
140+
EXPLAIN indexes = 1
141+
SELECT count() FROM logs WHERE hasToken(lower(msg), 'exception');
142+
```
143+
144+
See observability examples and guidance on token vs ngram [here](/use-cases/observability/schema-design#bloom-filters-for-text-search).
145+
146+
## Add indexes during CREATE TABLE (multiple examples) {#add-indexes-during-create-table-multiple-examples}
147+
148+
Skipping indexes also support composite expressions and `Map`/`Tuple`/`Nested` types. This is demonstrated in the example below:
149+
150+
```sql
151+
CREATE TABLE t
152+
(
153+
u64 UInt64,
154+
s String,
155+
m Map(String, String),
156+
157+
INDEX idx_bf u64 TYPE bloom_filter(0.01) GRANULARITY 3,
158+
INDEX idx_minmax u64 TYPE minmax GRANULARITY 1,
159+
INDEX idx_set u64 * length(s) TYPE set(1000) GRANULARITY 4,
160+
INDEX idx_ngram s TYPE ngrambf_v1(3, 10000, 3, 7) GRANULARITY 1,
161+
INDEX idx_token mapKeys(m) TYPE tokenbf_v1(10000, 7, 7) GRANULARITY 1
162+
)
163+
ENGINE = MergeTree
164+
ORDER BY u64;
165+
```
166+
167+
## Materializing on existing data and verifying {#materializing-on-existing-data-and-verifying}
168+
169+
You can add an index to existing data parts using `MATERIALIZE`, and inspect pruning with `EXPLAIN` or trace logs, as shown below:
170+
171+
```sql
172+
ALTER TABLE t MATERIALIZE INDEX idx_bf;
173+
174+
EXPLAIN indexes = 1
175+
SELECT count() FROM t WHERE u64 IN (123, 456);
176+
177+
-- Optional: detailed pruning info
178+
SET send_logs_level = 'trace';
179+
```
180+
181+
This [worked minmax example](/best-practices/use-data-skipping-indices-where-appropriate#example) demonstrates EXPLAIN output structure and pruning counts.
182+
183+
## When to use and when to avoid skipping indexes {#when-use-and-when-to-avoid}
184+
185+
**Use skip indexes when:**
186+
187+
* Filter values are sparse within data blocks
188+
* Strong correlation exists with `ORDER BY` columns or data ingestion patterns group similar values together
189+
* Performing text searches on large log datasets (`ngrambf_v1`/`tokenbf_v1` types)
190+
191+
**Avoid skip indexes when:**
192+
193+
* Most blocks likely contain at least one matching value (blocks will be read regardless)
194+
* Filtering on high-cardinality columns with no correlation to data ordering
195+
196+
:::note Important considerations
197+
If a value appears even once in a data block, ClickHouse must read the entire block. Test indexes with realistic datasets and adjust granularity and type-specific parameters based on actual performance measurements.
198+
:::
199+
200+
## Temporarily ignore or force indexes {#temporarily-ignore-or-force-indexes}
201+
202+
Disable specific indexes by name for individual queries during testing and troubleshooting. Settings also exist to force index usage when needed. See [`ignore_data_skipping_indices`](/operations/settings/settings#ignore_data_skipping_indices).
203+
204+
```sql
205+
-- Ignore an index by name
206+
SELECT * FROM logs
207+
WHERE hasToken(lower(msg), 'exception')
208+
SETTINGS ignore_data_skipping_indices = 'msg_token';
209+
```
210+
211+
## Notes and caveats {#notes-and-caveats}
212+
213+
* Skipping indexes are only supported on [MergeTree-family tables](/engines/table-engines/mergetree-family/mergetree); pruning happens at the granule/block level.
214+
* Bloom-filter-based indexes are probabilistic (false positives cause extra reads but won't skip valid data).
215+
* Bloom filters and other skip indexes should be validated with `EXPLAIN` and tracing; adjust granularity to balance pruning vs. index size.
216+
217+
## Related docs {#related-docs}
218+
- [Data skipping index guide](/optimize/skipping-indexes)
219+
- [Best practices guide](/best-practices/use-data-skipping-indices-where-appropriate)
220+
- [Manipulating data skipping indices](/sql-reference/statements/alter/skipping-index)
221+
- [System table information](/operations/system-tables/data_skipping_indices)

docs/guides/best-practices/skipping-indexes.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,3 +218,9 @@ data skipping index behavior is not easily predictable. Adding them to a table i
218218
that for any number of reasons don't benefit from the index. They should always be tested on real world type of data, and testing should
219219
include variations of the type, granularity size and other parameters. Testing will often reveal patterns and pitfalls that aren't obvious from
220220
thought experiments alone.
221+
222+
## Related docs {#related-docs}
223+
- [Best practices guide](/best-practices/use-data-skipping-indices-where-appropriate)
224+
- [Data skipping index examples](/optimize/skipping-indexes/examples)
225+
- [Manipulating data skipping indices](/sql-reference/statements/alter/skipping-index)
226+
- [System table information](/operations/system-tables/data_skipping_indices)

scripts/aspell-ignore/en/aspell-dict.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1334,6 +1334,7 @@ Trino
13341334
Tsai
13351335
Tunable
13361336
Tukey
1337+
Tunable
13371338
TwoColumnList
13381339
TypeScript
13391340
UBSan

sidebars.js

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1122,7 +1122,16 @@ const sidebars = {
11221122
"guides/best-practices/sparse-primary-indexes",
11231123
"guides/best-practices/query-parallelism",
11241124
"guides/best-practices/partitioningkey",
1125-
"guides/best-practices/skipping-indexes",
1125+
{
1126+
type: "category",
1127+
label: "Data Skipping Indexes",
1128+
collapsed: true,
1129+
collapsible: true,
1130+
link: { type: "doc", id: "guides/best-practices/skipping-indexes" },
1131+
items: [
1132+
"guides/best-practices/skipping-indexes-examples"
1133+
],
1134+
},
11261135
"guides/best-practices/prewhere",
11271136
"guides/best-practices/bulkinserts",
11281137
"guides/best-practices/asyncinserts",

0 commit comments

Comments
 (0)