You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Learn how Tantivy indexing works in OpenObserve, including full-text and secondary indexes, query behaviors with AND and OR operators, and how to verify index usage.
4
4
---
5
-
This document explains Tantivy indexing in OpenObserve, the types of indexes it builds, how to use the correct query patterns, and how to verify and configure indexing.
5
+
This document explains Tantivy indexing in OpenObserve, the types of indexes it builds, how to use the correct query patterns for both single-stream and multi-stream queries, and how to verify and configure indexing.
6
6
7
7
> Tantivy indexing is an open-source feature in OpenObserve.
8
8
9
9
## What is Tantivy?
10
10
Tantivy is the inverted index library used in OpenObserve to accelerate searches. An inverted index keeps a map of values or tokens and the row IDs of the records that contain them. When a user searches for a value, the query can use this index to go directly to the matching rows instead of scanning every log record.
11
11
12
+
## Index types
12
13
Tantivy builds two kinds of indexes in OpenObserve:
13
14
14
-
## Full-text index
15
+
###Full-text index
15
16
For fields such as `body` or `message` that contain sentences or long text. The field is split into tokens, and each token is mapped to the records that contain it.
16
17
17
18
**Example log records** <br>
@@ -22,102 +23,103 @@ For fields such as `body` or `message` that contain sentences or long text. The
22
23
23
24
The log body `POST /api/metrics error` is stored as tokens `POST`, `api`, `metrics`, `error`. A search for `error` looks up that token in the index and immediately finds the matching records.
24
25
25
-
## Secondary index
26
-
For fields that represent a single exact value. For example, `k8s_namespace_name`. In this case, the entire field value is treated as one token and indexed.
26
+
###Secondary index
27
+
For fields that represent a single exact value. For example, `kubernetes_namespace_name`. In this case, the entire field value is treated as one token and indexed.
For `k8s_namespace_name`, the index might look like:
36
+
For `kubernetes_namespace_name`, the index might look like:
36
37
37
38
-`ingress-nginx` > [Row 1, Row 3]
38
39
-`ziox` > [Row 2]
39
40
-`cert-manager` > [Row 4]
40
41
41
-
A query for `k8s_namespace_name = 'ingress-nginx'` retrieves those rows directly, without scanning unrelated records. By keeping these indexes, Tantivy avoids full scans across millions or billions of records. This results in queries that return in milliseconds rather than seconds.
42
+
A query for `kubernetes_namespace_name = 'ingress-nginx'` retrieves those rows directly, without scanning unrelated records. By keeping these indexes, Tantivy avoids full scans across millions or billions of records. This results in queries that return in milliseconds rather than seconds.
42
43
43
44
## Configure Environment Variable
44
45
To enable Tantivy indexing, configure the following environment variable:
45
-
```
46
+
```bash
46
47
ZO_ENABLE_INVERTED_INDEX = true
47
48
```
48
49
49
50
## Query behavior
50
-
Tantivy optimizes queries differently based on whether the field is full-text or secondary. Using the right operator for each field type ensures the query is served from the index instead of scanning logs.
51
-
52
-
### Full-text index scenarios
51
+
Tantivy optimizes queries differently based on whether the field is full-text or secondary, and whether the query operates on a single stream or multiple streams. Using the right operator for each field type ensures the query is served from the index instead of scanning logs.
53
52
54
-
**Correct usage** <br>
53
+
## Single-stream queries
54
+
A single-stream query retrieves data from one stream without using JOIN operations or subqueries that involve multiple streams.
55
55
56
-
- Use `match_all()` for full-text index fields such as `body` or `message`:
57
-
```sql
58
-
-- Return logs whose body contains the token "error"
59
-
WHERE match_all('error');
60
-
```
61
-
- Use `NOT` with `match_all()`:
62
-
```sql
63
-
-- Exclude logs whose body contains the token "error"
64
-
WHERE NOT match_all('error');
65
-
```
56
+
### Full-text index scenarios
66
57
67
-
**Inefficient usage** <br>
68
-
```sql
69
-
-- Forces full string equality, bypasses token index
70
-
WHERE body ='error';
71
-
```
58
+
!!! info "Correct usage:"
59
+
- Use `match_all()` for full-text index fields such as `body` or `message`:
60
+
```sql linenums="1"
61
+
-- Return logs whose body contains the token "error"
62
+
WHERE match_all('error');
63
+
```
64
+
- Use `NOT` with `match_all()`:
65
+
```sql linenums="1"
66
+
-- Exclude logs whose body contains the token "error"
67
+
WHERE NOT match_all('error');
68
+
```
69
+
70
+
!!! warning "Inefficient usage:"
71
+
```sql linenums="1"
72
+
-- Forces full string equality, bypasses token index
73
+
WHERE body = 'error';
74
+
```
72
75
73
76
### Secondary index scenarios
74
77
75
-
**Correct usage**
76
-
77
-
- Use `=` or `IN (...)` for secondary index fields such as `k8s_namespace_name`, `k8s_pod_name`, or `k8s_container_name`.
78
-
```sql
79
-
-- Single value
80
-
WHERE k8s_namespace_name ='ingress-nginx';
81
-
82
-
-- Multiple values
83
-
WHERE k8s_namespace_name IN ('ingress-nginx', 'ziox', 'cert-manager');
84
-
```
85
-
- Use NOT with `=` or `IN (...)`
86
-
```sql
87
-
-- Exclude one exact value
88
-
WHERE NOT (k8s_namespace_name ='ingress-nginx');
89
-
90
-
-- Exclude multiple values
91
-
WHERE k8s_namespace_name NOT IN ('ziox', 'cert-manager');
92
-
```
93
-
94
-
**Inefficient usage**
95
-
```sql
96
-
-- Treated as a token search, no advantage over '='
97
-
WHERE match_all('ingress-nginx');
98
-
```
78
+
!!! info "Correct usage:"
79
+
- Use `=` or `IN (...)` for secondary index fields such as `kubernetes_namespace_name`, `kubernetes_pod_name`, or `kubernetes_container_name`.
80
+
```sql linenums="1"
81
+
-- Single value
82
+
WHERE kubernetes_namespace_name = 'ingress-nginx';
83
+
84
+
-- Multiple values
85
+
WHERE kubernetes_namespace_name IN ('ingress-nginx', 'ziox', 'cert-manager');
86
+
```
87
+
- Use NOT with `=` or `IN (...)`
88
+
```sql linenums="1"
89
+
-- Exclude one exact value
90
+
WHERE NOT (kubernetes_namespace_name = 'ingress-nginx');
91
+
92
+
-- Exclude multiple values
93
+
WHERE kubernetes_namespace_name NOT IN ('ziox', 'cert-manager');
94
+
```
95
+
96
+
!!! warning "Inefficient usage:"
97
+
```sql linenums="1"
98
+
-- Treated as a token search, no advantage over '='
99
+
WHERE match_all('ingress-nginx');
100
+
```
99
101
100
102
### Mixed scenarios
101
103
102
104
When a query combines full-text and secondary fields, apply the best operator for each part.
WHERE match_all('error') ANDk8s_namespace_name='ingress-nginx';
135
+
WHERE match_all('error') ANDkubernetes_namespace_name='ingress-nginx';
134
136
135
137
-- Mixed: one side indexable, one not
136
138
WHERE match_all('error') AND body LIKE'%error%';
@@ -142,20 +144,145 @@ WHERE match_all('error') AND body LIKE '%error%';
142
144
- If any branch is not indexable, the entire OR is not indexable. The query runs in DataFusion.
143
145
144
146
**Examples**
145
-
```sql
147
+
```sql linenums="1"
146
148
-- Fast: both indexable
147
-
WHERE match_all('error') ORk8s_namespace_name='ziox';
149
+
WHERE match_all('error') ORkubernetes_namespace_name='ziox';
148
150
149
151
-- Slower: both sides are not indexable
150
152
WHERE match_all('error') OR body LIKE'%error%';
151
153
```
152
154
153
155
**NOT with grouped conditions** <br>
154
-
```sql
156
+
```sql linenums="1"
155
157
-- Exclude when either namespace = ziox OR body contains error
156
-
WHERE NOT (k8s_namespace_name ='ziox'OR match_all('error'));
158
+
WHERE NOT (kubernetes_namespace_name ='ziox'OR match_all('error'));
159
+
```
160
+
161
+
## Multi-stream queries
162
+
A multi-stream query combines data from two or more streams using JOIN operations or subqueries that convert to JOINs internally. OpenObserve applies Tantivy indexing to both sides of a JOIN to accelerate data retrieval.
163
+
164
+
### What are multi-stream queries?
165
+
When a subquery converts to a JOIN, OpenObserve combines data from two sources. In a JOIN operation:
166
+
167
+
- The left table is the first table in the JOIN operation. It is the base table that the query starts with.
168
+
- The right table is the second table in the JOIN operation. It provides additional data that is matched against the left table based on a join condition.
169
+
170
+
171
+
The query engine reads rows from the left table, then for each row, it looks up matching rows in the right table using the join condition.
172
+
173
+
Example:
174
+
```sql linenums="1"
175
+
SELECTt1.idFROM t1 JOIN t2 ONt1.id=t2.id
176
+
```
177
+
In this query:
178
+
179
+
-`t1` is the left table. It is the base table.
180
+
-`t2` is the right table. It is the table being matched.
181
+
- The join condition `t1.id = t2.id` determines which rows from both tables are combined.
182
+
183
+
When a query includes a subquery in a WHERE clause with an IN operator, OpenObserve converts it to a JOIN operation. For example:
184
+
```sql linenums="1"
185
+
SELECT kubernetes_namespace_name
186
+
FROM default
187
+
WHERE kubernetes_namespace_name IN (
188
+
SELECT DISTINCT kubernetes_namespace_name
189
+
FROM default
190
+
WHERE kubernetes_container_name ='ziox'
191
+
);
192
+
```
193
+
This query internally converts to a JOIN where:
194
+
195
+
- The left table is the outer query, selecting `kubernetes_namespace_name` from `default`.
196
+
- The right table is the subquery, selecting distinct `kubernetes_namespace_name` values where `kubernetes_container_name` is `ziox`.
197
+
198
+
Tantivy can use indexes on both the left table and the right table to accelerate the query.
199
+
200
+
### How indexing works in multi-stream queries
201
+
When OpenObserve executes a multi-stream query:
202
+
203
+
1. The query optimizer identifies indexable conditions on both the left table and the right table of the JOIN.
204
+
2. Tantivy retrieves row identifiers from the index for each table independently.
205
+
3. The query engine combines the results based on the JOIN condition.
206
+
4. If both tables use indexes, the query avoids scanning unrelated records entirely.
207
+
208
+
For example,
209
+
```sql linenums="1"
210
+
SELECT DISTINCT kubernetes_namespace_name
211
+
FROM default
212
+
WHERE kubernetes_pod_name ='ziox-ingester-0'
213
+
AND kubernetes_namespace_name IN (
214
+
SELECT DISTINCT kubernetes_namespace_name
215
+
FROM default
216
+
WHERE kubernetes_container_name ='ziox'
217
+
)
218
+
ORDER BY kubernetes_namespace_name DESC
219
+
LIMIT10;
157
220
```
158
221
222
+
In this query, the subquery uses the secondary index on `kubernetes_container_name` to find matching namespaces, while the outer query uses the secondary index on `kubernetes_pod_name`. Both sides benefit from Tantivy indexing, eliminating the need for full table scans.
223
+
224
+
### Index optimizer for subqueries
225
+
OpenObserve includes an index optimizer that identifies opportunities to use indexes within subqueries and JOIN operations. This optimizer works automatically when queries include subqueries with conditions on indexed fields.
226
+
227
+
```sql linenums="1"
228
+
select kubernetes_namespace_name,
229
+
array_agg(distinct kubernetes_container_name) as container_name
230
+
from default
231
+
where log like'%zinc%'
232
+
and kubernetes_namespace_name in (
233
+
select distinct kubernetes_namespace_name
234
+
from default
235
+
order by kubernetes_namespace_name limit10000)
236
+
group by kubernetes_namespace_name
237
+
order by kubernetes_namespace_name
238
+
limit10
239
+
```
240
+
The subquery portion uses the secondary index on `kubernetes_namespace_name` to accelerate retrieval of the first 10,000 distinct namespaces. The outer query filters by log LIKE `'%zinc%'`, which cannot use an index, but the subquery still benefits from indexing.
241
+
242
+
### match_all in multi-stream queries
243
+
The `match_all()` function is supported in multi-stream queries with specific limitations. OpenObserve checks whether the full-text index field exists in the stream before applying `match_all()`.
244
+
245
+
!!! info "Supported scenarios:"
246
+
Use `match_all()` in subqueries that filter a single stream:
247
+
```sql linenums="1"
248
+
SELECT *
249
+
FROM (
250
+
SELECT *
251
+
FROM default
252
+
WHERE match_all('error')
253
+
) AS filtered_logs;
254
+
```
255
+
Use `match_all()` in both the outer query and a subquery with an IN condition:
256
+
```sql linenums="1"
257
+
SELECT *
258
+
FROM default
259
+
WHERE id IN (
260
+
SELECT id
261
+
FROM default
262
+
WHERE match_all('error')
263
+
)
264
+
AND match_all('critical');
265
+
```
266
+
In this example, both the subquery and outer query apply full-text search using `match_all()`, and both leverage the full-text index to retrieve matching row identifiers.
267
+
268
+
!!! info "Unsupported scenarios:"
269
+
Do not use `match_all()` outside a subquery when the subquery contains aggregation or grouping:
270
+
271
+
```sql linenums="1"
272
+
SELECT *
273
+
FROM (
274
+
SELECT kubernetes_namespace_name, COUNT(*)
275
+
FROM default
276
+
GROUP BY kubernetes_namespace_name
277
+
ORDER BY COUNT(*)
278
+
) AS aggregated
279
+
WHERE match_all('error');
280
+
```
281
+
In this case, `match_all('error')` cannot determine which stream to search because the subquery has already aggregated the data.
282
+
283
+
### Partitioned search with inverted index
284
+
OpenObserve searches individual partitions using the inverted index when executing multi-stream queries. This behavior ensures that queries distribute efficiently across partitions and leverage indexing at the partition level.
285
+
159
286
## Verify if a query is using Tantivy
160
287
To confirm whether a query used the Tantivy inverted index:
161
288
@@ -164,4 +291,5 @@ To confirm whether a query used the Tantivy inverted index:
164
291
3. Under took_detail, check the value of `idx_took`:
165
292
166
293
- If `idx_took` is greater than `0`, the query used the inverted index.
167
-
- If `idx_took` is `0`, the query did not use the inverted index.
294
+
- If `idx_took` is `0`, the query did not use the inverted index.
0 commit comments