Skip to content

Commit c6f6875

Browse files
docs: add multi-stream query support to Tantivy indexing documentation (#233)
1 parent 4d66496 commit c6f6875

File tree

2 files changed

+206
-74
lines changed

2 files changed

+206
-74
lines changed

docs/operator-guide/config-management.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
---
2+
title: Custom Configuration File and Dynamic Reloading in OpenObserve
3+
description: Learn how to use custom config paths and dynamic config reloading in OpenObserve to apply changes without restarts.
4+
---
15
This guide explains how to use custom configuration file locations and
26
dynamic configuration reloading in OpenObserve to manage deployments
37
without system restarts.

docs/user-guide/performance/tantivy-index.md

Lines changed: 202 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,17 @@
22
title: Tantivy Indexing in OpenObserve
33
description: Learn how Tantivy indexing works in OpenObserve, including full-text and secondary indexes, query behaviors with AND and OR operators, and how to verify index usage.
44
---
5-
This document explains Tantivy indexing in OpenObserve, the types of indexes it builds, how to use the correct query patterns, and how to verify and configure indexing.
5+
This document explains Tantivy indexing in OpenObserve, the types of indexes it builds, how to use the correct query patterns for both single-stream and multi-stream queries, and how to verify and configure indexing.
66

77
> Tantivy indexing is an open-source feature in OpenObserve.
88
99
## What is Tantivy?
1010
Tantivy is the inverted index library used in OpenObserve to accelerate searches. An inverted index keeps a map of values or tokens and the row IDs of the records that contain them. When a user searches for a value, the query can use this index to go directly to the matching rows instead of scanning every log record.
1111

12+
## Index types
1213
Tantivy builds two kinds of indexes in OpenObserve:
1314

14-
## Full-text index
15+
### Full-text index
1516
For fields such as `body` or `message` that contain sentences or long text. The field is split into tokens, and each token is mapped to the records that contain it.
1617

1718
**Example log records** <br>
@@ -22,102 +23,103 @@ For fields such as `body` or `message` that contain sentences or long text. The
2223

2324
The log body `POST /api/metrics error` is stored as tokens `POST`, `api`, `metrics`, `error`. A search for `error` looks up that token in the index and immediately finds the matching records.
2425

25-
## Secondary index
26-
For fields that represent a single exact value. For example, `k8s_namespace_name`. In this case, the entire field value is treated as one token and indexed.
26+
### Secondary index
27+
For fields that represent a single exact value. For example, `kubernetes_namespace_name`. In this case, the entire field value is treated as one token and indexed.
2728

2829
**Example log records** <br>
2930

30-
- Row 1: `k8s_namespace_name = ingress-nginx`
31-
- Row 2: `k8s_namespace_name = ziox`
32-
- Row 3: `k8s_namespace_name = ingress-nginx`
33-
- Row 4: `k8s_namespace_name = cert-manager`
31+
- Row 1: `kubernetes_namespace_name = ingress-nginx`
32+
- Row 2: `kubernetes_namespace_name = ziox`
33+
- Row 3: `kubernetes_namespace_name = ingress-nginx`
34+
- Row 4: `kubernetes_namespace_name = cert-manager`
3435

35-
For `k8s_namespace_name`, the index might look like:
36+
For `kubernetes_namespace_name`, the index might look like:
3637

3738
- `ingress-nginx` > [Row 1, Row 3]
3839
- `ziox` > [Row 2]
3940
- `cert-manager` > [Row 4]
4041

41-
A query for `k8s_namespace_name = 'ingress-nginx'` retrieves those rows directly, without scanning unrelated records. By keeping these indexes, Tantivy avoids full scans across millions or billions of records. This results in queries that return in milliseconds rather than seconds.
42+
A query for `kubernetes_namespace_name = 'ingress-nginx'` retrieves those rows directly, without scanning unrelated records. By keeping these indexes, Tantivy avoids full scans across millions or billions of records. This results in queries that return in milliseconds rather than seconds.
4243

4344
## Configure Environment Variable
4445
To enable Tantivy indexing, configure the following environment variable:
45-
```
46+
```bash
4647
ZO_ENABLE_INVERTED_INDEX = true
4748
```
4849

4950
## Query behavior
50-
Tantivy optimizes queries differently based on whether the field is full-text or secondary. Using the right operator for each field type ensures the query is served from the index instead of scanning logs.
51-
52-
### Full-text index scenarios
51+
Tantivy optimizes queries differently based on whether the field is full-text or secondary, and whether the query operates on a single stream or multiple streams. Using the right operator for each field type ensures the query is served from the index instead of scanning logs.
5352

54-
**Correct usage** <br>
53+
## Single-stream queries
54+
A single-stream query retrieves data from one stream without using JOIN operations or subqueries that involve multiple streams.
5555

56-
- Use `match_all()` for full-text index fields such as `body` or `message`:
57-
```sql
58-
-- Return logs whose body contains the token "error"
59-
WHERE match_all('error');
60-
```
61-
- Use `NOT` with `match_all()`:
62-
```sql
63-
-- Exclude logs whose body contains the token "error"
64-
WHERE NOT match_all('error');
65-
```
56+
### Full-text index scenarios
6657

67-
**Inefficient usage** <br>
68-
```sql
69-
-- Forces full string equality, bypasses token index
70-
WHERE body = 'error';
71-
```
58+
!!! info "Correct usage:"
59+
- Use `match_all()` for full-text index fields such as `body` or `message`:
60+
```sql linenums="1"
61+
-- Return logs whose body contains the token "error"
62+
WHERE match_all('error');
63+
```
64+
- Use `NOT` with `match_all()`:
65+
```sql linenums="1"
66+
-- Exclude logs whose body contains the token "error"
67+
WHERE NOT match_all('error');
68+
```
69+
70+
!!! warning "Inefficient usage:"
71+
```sql linenums="1"
72+
-- Forces full string equality, bypasses token index
73+
WHERE body = 'error';
74+
```
7275

7376
### Secondary index scenarios
7477

75-
**Correct usage**
76-
77-
- Use `=` or `IN (...)` for secondary index fields such as `k8s_namespace_name`, `k8s_pod_name`, or `k8s_container_name`.
78-
```sql
79-
-- Single value
80-
WHERE k8s_namespace_name = 'ingress-nginx';
81-
82-
-- Multiple values
83-
WHERE k8s_namespace_name IN ('ingress-nginx', 'ziox', 'cert-manager');
84-
```
85-
- Use NOT with `=` or `IN (...)`
86-
```sql
87-
-- Exclude one exact value
88-
WHERE NOT (k8s_namespace_name = 'ingress-nginx');
89-
90-
-- Exclude multiple values
91-
WHERE k8s_namespace_name NOT IN ('ziox', 'cert-manager');
92-
```
93-
94-
**Inefficient usage**
95-
```sql
96-
-- Treated as a token search, no advantage over '='
97-
WHERE match_all('ingress-nginx');
98-
```
78+
!!! info "Correct usage:"
79+
- Use `=` or `IN (...)` for secondary index fields such as `kubernetes_namespace_name`, `kubernetes_pod_name`, or `kubernetes_container_name`.
80+
```sql linenums="1"
81+
-- Single value
82+
WHERE kubernetes_namespace_name = 'ingress-nginx';
83+
84+
-- Multiple values
85+
WHERE kubernetes_namespace_name IN ('ingress-nginx', 'ziox', 'cert-manager');
86+
```
87+
- Use NOT with `=` or `IN (...)`
88+
```sql linenums="1"
89+
-- Exclude one exact value
90+
WHERE NOT (kubernetes_namespace_name = 'ingress-nginx');
91+
92+
-- Exclude multiple values
93+
WHERE kubernetes_namespace_name NOT IN ('ziox', 'cert-manager');
94+
```
95+
96+
!!! warning "Inefficient usage:"
97+
```sql linenums="1"
98+
-- Treated as a token search, no advantage over '='
99+
WHERE match_all('ingress-nginx');
100+
```
99101

100102
### Mixed scenarios
101103

102104
When a query combines full-text and secondary fields, apply the best operator for each part.
103105

104-
**Correct usage**
106+
!!! info "Correct usage:"
105107

106-
```sql
107-
WHERE match_all('error')
108-
AND k8s_namespace_name = 'ingress-nginx';
109-
```
108+
```sql linenums="1"
109+
WHERE match_all('error')
110+
AND kubernetes_namespace_name = 'ingress-nginx';
111+
```
110112

111-
- `match_all('error')` uses full-text index.
112-
- `k8s_namespace_name = 'ingress-nginx'` uses secondary index.
113+
- `match_all('error')` uses full-text index.
114+
- `kubernetes_namespace_name = 'ingress-nginx'` uses secondary index.
113115

114-
**Incorrect usage**
116+
!!! warning "Incorrect usage:"
115117

116-
```sql
117-
-- Both operators used incorrectly
118-
WHERE body = 'error'
119-
AND match_all('ingress-nginx');
120-
```
118+
```sql linenums="1"
119+
-- Both operators used incorrectly
120+
WHERE body = 'error'
121+
AND match_all('ingress-nginx');
122+
```
121123

122124
### AND and OR operator behavior
123125

@@ -128,9 +130,9 @@ WHERE body = 'error'
128130

129131

130132
**Examples**
131-
```sql
133+
```sql linenums="1"
132134
-- Fast: both sides indexable
133-
WHERE match_all('error') AND k8s_namespace_name = 'ingress-nginx';
135+
WHERE match_all('error') AND kubernetes_namespace_name = 'ingress-nginx';
134136

135137
-- Mixed: one side indexable, one not
136138
WHERE match_all('error') AND body LIKE '%error%';
@@ -142,20 +144,145 @@ WHERE match_all('error') AND body LIKE '%error%';
142144
- If any branch is not indexable, the entire OR is not indexable. The query runs in DataFusion.
143145

144146
**Examples**
145-
```sql
147+
```sql linenums="1"
146148
-- Fast: both indexable
147-
WHERE match_all('error') OR k8s_namespace_name = 'ziox';
149+
WHERE match_all('error') OR kubernetes_namespace_name = 'ziox';
148150

149151
-- Slower: both sides are not indexable
150152
WHERE match_all('error') OR body LIKE '%error%';
151153
```
152154

153155
**NOT with grouped conditions** <br>
154-
```sql
156+
```sql linenums="1"
155157
-- Exclude when either namespace = ziox OR body contains error
156-
WHERE NOT (k8s_namespace_name = 'ziox' OR match_all('error'));
158+
WHERE NOT (kubernetes_namespace_name = 'ziox' OR match_all('error'));
159+
```
160+
161+
## Multi-stream queries
162+
A multi-stream query combines data from two or more streams using JOIN operations or subqueries that convert to JOINs internally. OpenObserve applies Tantivy indexing to both sides of a JOIN to accelerate data retrieval.
163+
164+
### What are multi-stream queries?
165+
When a subquery converts to a JOIN, OpenObserve combines data from two sources. In a JOIN operation:
166+
167+
- The left table is the first table in the JOIN operation. It is the base table that the query starts with.
168+
- The right table is the second table in the JOIN operation. It provides additional data that is matched against the left table based on a join condition.
169+
170+
171+
The query engine reads rows from the left table, then for each row, it looks up matching rows in the right table using the join condition.
172+
173+
Example:
174+
```sql linenums="1"
175+
SELECT t1.id FROM t1 JOIN t2 ON t1.id = t2.id
176+
```
177+
In this query:
178+
179+
- `t1` is the left table. It is the base table.
180+
- `t2` is the right table. It is the table being matched.
181+
- The join condition `t1.id = t2.id` determines which rows from both tables are combined.
182+
183+
When a query includes a subquery in a WHERE clause with an IN operator, OpenObserve converts it to a JOIN operation. For example:
184+
```sql linenums="1"
185+
SELECT kubernetes_namespace_name
186+
FROM default
187+
WHERE kubernetes_namespace_name IN (
188+
SELECT DISTINCT kubernetes_namespace_name
189+
FROM default
190+
WHERE kubernetes_container_name = 'ziox'
191+
);
192+
```
193+
This query internally converts to a JOIN where:
194+
195+
- The left table is the outer query, selecting `kubernetes_namespace_name` from `default`.
196+
- The right table is the subquery, selecting distinct `kubernetes_namespace_name` values where `kubernetes_container_name` is `ziox`.
197+
198+
Tantivy can use indexes on both the left table and the right table to accelerate the query.
199+
200+
### How indexing works in multi-stream queries
201+
When OpenObserve executes a multi-stream query:
202+
203+
1. The query optimizer identifies indexable conditions on both the left table and the right table of the JOIN.
204+
2. Tantivy retrieves row identifiers from the index for each table independently.
205+
3. The query engine combines the results based on the JOIN condition.
206+
4. If both tables use indexes, the query avoids scanning unrelated records entirely.
207+
208+
For example,
209+
```sql linenums="1"
210+
SELECT DISTINCT kubernetes_namespace_name
211+
FROM default
212+
WHERE kubernetes_pod_name = 'ziox-ingester-0'
213+
AND kubernetes_namespace_name IN (
214+
SELECT DISTINCT kubernetes_namespace_name
215+
FROM default
216+
WHERE kubernetes_container_name = 'ziox'
217+
)
218+
ORDER BY kubernetes_namespace_name DESC
219+
LIMIT 10;
157220
```
158221

222+
In this query, the subquery uses the secondary index on `kubernetes_container_name` to find matching namespaces, while the outer query uses the secondary index on `kubernetes_pod_name`. Both sides benefit from Tantivy indexing, eliminating the need for full table scans.
223+
224+
### Index optimizer for subqueries
225+
OpenObserve includes an index optimizer that identifies opportunities to use indexes within subqueries and JOIN operations. This optimizer works automatically when queries include subqueries with conditions on indexed fields.
226+
227+
```sql linenums="1"
228+
select kubernetes_namespace_name,
229+
array_agg(distinct kubernetes_container_name) as container_name
230+
from default
231+
where log like '%zinc%'
232+
and kubernetes_namespace_name in (
233+
select distinct kubernetes_namespace_name
234+
from default
235+
order by kubernetes_namespace_name limit 10000)
236+
group by kubernetes_namespace_name
237+
order by kubernetes_namespace_name
238+
limit 10
239+
```
240+
The subquery portion uses the secondary index on `kubernetes_namespace_name` to accelerate retrieval of the first 10,000 distinct namespaces. The outer query filters by log LIKE `'%zinc%'`, which cannot use an index, but the subquery still benefits from indexing.
241+
242+
### match_all in multi-stream queries
243+
The `match_all()` function is supported in multi-stream queries with specific limitations. OpenObserve checks whether the full-text index field exists in the stream before applying `match_all()`.
244+
245+
!!! info "Supported scenarios:"
246+
Use `match_all()` in subqueries that filter a single stream:
247+
```sql linenums="1"
248+
SELECT *
249+
FROM (
250+
SELECT *
251+
FROM default
252+
WHERE match_all('error')
253+
) AS filtered_logs;
254+
```
255+
Use `match_all()` in both the outer query and a subquery with an IN condition:
256+
```sql linenums="1"
257+
SELECT *
258+
FROM default
259+
WHERE id IN (
260+
SELECT id
261+
FROM default
262+
WHERE match_all('error')
263+
)
264+
AND match_all('critical');
265+
```
266+
In this example, both the subquery and outer query apply full-text search using `match_all()`, and both leverage the full-text index to retrieve matching row identifiers.
267+
268+
!!! info "Unsupported scenarios:"
269+
Do not use `match_all()` outside a subquery when the subquery contains aggregation or grouping:
270+
271+
```sql linenums="1"
272+
SELECT *
273+
FROM (
274+
SELECT kubernetes_namespace_name, COUNT(*)
275+
FROM default
276+
GROUP BY kubernetes_namespace_name
277+
ORDER BY COUNT(*)
278+
) AS aggregated
279+
WHERE match_all('error');
280+
```
281+
In this case, `match_all('error')` cannot determine which stream to search because the subquery has already aggregated the data.
282+
283+
### Partitioned search with inverted index
284+
OpenObserve searches individual partitions using the inverted index when executing multi-stream queries. This behavior ensures that queries distribute efficiently across partitions and leverage indexing at the partition level.
285+
159286
## Verify if a query is using Tantivy
160287
To confirm whether a query used the Tantivy inverted index:
161288

@@ -164,4 +291,5 @@ To confirm whether a query used the Tantivy inverted index:
164291
3. Under took_detail, check the value of `idx_took`:
165292

166293
- If `idx_took` is greater than `0`, the query used the inverted index.
167-
- If `idx_took` is `0`, the query did not use the inverted index.
294+
- If `idx_took` is `0`, the query did not use the inverted index.
295+

0 commit comments

Comments
 (0)