Skip to content

Commit 5b3763d

Browse files
Copilotwaynexia
andauthored
docs: Update greptime_identity pipeline for automatic flattening and max_nested_levels (#2197)
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: waynexia <15380403+waynexia@users.noreply.github.com>
1 parent 6ae80b4 commit 5b3763d

File tree

8 files changed

+174
-130
lines changed

8 files changed

+174
-130
lines changed

docs/reference/pipeline/built-in-pipelines.md

Lines changed: 42 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,22 @@ Additionally, the "greptime_" prefix of the pipeline name is reserved.
1212

1313
## `greptime_identity`
1414

15-
The `greptime_identity` pipeline is designed for writing JSON logs and automatically creates columns for each field in the JSON log.
15+
The `greptime_identity` pipeline is designed for writing JSON logs and automatically creates columns for each field in the JSON log. Nested JSON objects are automatically flattened into separate columns using dot notation.
1616

17-
- The first-level keys in the JSON log are used as column names.
18-
- An error is returned if the same field has different types.
19-
- Fields with `null` values are ignored.
20-
- If time index is not specified, an additional column, `greptime_timestamp`, is added to the table as the time index to indicate when the log was written.
17+
- Nested objects are automatically flattened (e.g., `{"a": {"b": 1}}` becomes column `a.b`)
18+
- Arrays are converted to JSON strings
19+
- An error is returned if the same field has different types
20+
- Fields with `null` values are ignored
21+
- If time index is not specified, an additional column, `greptime_timestamp`, is added to the table as the time index to indicate when the log was written
2122

2223
### Type conversion rules
2324

2425
- `string` -> `string`
2526
- `number` -> `int64` or `float64`
2627
- `boolean` -> `bool`
2728
- `null` -> ignore
28-
- `array` -> `json`
29-
- `object` -> `json`
29+
- `array` -> `string` (JSON-stringified)
30+
- `object` -> automatically flattened into separate columns (see [Flatten JSON objects](#flatten-json-objects))
3031

3132

3233
For example, if we have the following json data:
@@ -39,7 +40,7 @@ For example, if we have the following json data:
3940
]
4041
```
4142

42-
We'll merge the schema for each row of this batch to get the final schema. The table schema will be:
43+
We'll merge the schema for each row of this batch to get the final schema. Note that nested objects are automatically flattened into separate columns (e.g., `object.a`, `object.b`), and arrays are converted to JSON strings. The table schema will be:
4344

4445
```sql
4546
mysql> desc pipeline_logs;
@@ -49,26 +50,27 @@ mysql> desc pipeline_logs;
4950
| age | Int64 | | YES | | FIELD |
5051
| is_student | Boolean | | YES | | FIELD |
5152
| name | String | | YES | | FIELD |
52-
| object | Json | | YES | | FIELD |
53+
| object.a | Int64 | | YES | | FIELD |
54+
| object.b | Int64 | | YES | | FIELD |
5355
| score | Float64 | | YES | | FIELD |
5456
| company | String | | YES | | FIELD |
55-
| array | Json | | YES | | FIELD |
57+
| array | String | | YES | | FIELD |
5658
| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP |
5759
+--------------------+---------------------+------+------+---------+---------------+
58-
8 rows in set (0.00 sec)
60+
9 rows in set (0.00 sec)
5961
```
6062

6163
The data will be stored in the table as follows:
6264

6365
```sql
6466
mysql> select * from pipeline_logs;
65-
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
66-
| age | is_student | name | object | score | company | array | greptime_timestamp |
67-
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
68-
| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 |
69-
| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 |
70-
| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 |
71-
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
67+
+------+------------+---------+----------+----------+-------+---------+-----------+----------------------------+
68+
| age | is_student | name | object.a | object.b | score | company | array | greptime_timestamp |
69+
+------+------------+---------+----------+----------+-------+---------+-----------+----------------------------+
70+
| 22 | 1 | Charlie | NULL | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 |
71+
| 21 | 0 | NULL | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 |
72+
| 20 | 1 | Alice | 1 | 2 | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 |
73+
+------+------------+---------+----------+----------+-------+---------+-----------+----------------------------+
7274
3 rows in set (0.01 sec)
7375
```
7476

@@ -121,33 +123,38 @@ Here are some example of using `custom_time_index` assuming the time variable is
121123

122124
### Flatten JSON objects
123125

124-
If flattening a JSON object into a single-level structure is needed, add the `x-greptime-pipeline-params` header to the request and set `flatten_json_object` to `true`.
126+
The `greptime_identity` pipeline **automatically flattens** nested JSON objects into a single-level structure. This behavior is always enabled and creates separate columns for each nested field using dot notation (e.g., `a.b.c`).
127+
128+
#### Controlling flattening depth
129+
130+
You can control how deeply nested objects are flattened using the `max_nested_levels` parameter in the `x-greptime-pipeline-params` header. The default value is 10 levels.
125131

126132
Here is a sample request:
127133

128134
```shell
129135
curl -X "POST" "http://localhost:4000/v1/ingest?db=<db-name>&table=<table-name>&pipeline_name=greptime_identity&version=<pipeline-version>" \
130136
-H "Content-Type: application/x-ndjson" \
131137
-H "Authorization: Basic {{authentication}}" \
132-
-H "x-greptime-pipeline-params: flatten_json_object=true" \
138+
-H "x-greptime-pipeline-params: max_nested_levels=5" \
133139
-d "$<log-items>"
134140
```
135141

136-
With this configuration, GreptimeDB will automatically flatten each field of the JSON object into separate columns. For example:
142+
When the maximum nesting level is reached, any remaining nested structure is converted to a JSON string and stored in a single column. For example, with `max_nested_levels=3`:
137143

138144
```JSON
139145
{
140146
"a": {
141147
"b": {
142-
"c": [1, 2, 3]
148+
"c": {
149+
"d": [1, 2, 3]
150+
}
143151
}
144152
},
145-
"d": [
153+
"e": [
146154
"foo",
147155
"bar"
148156
],
149-
"e": {
150-
"f": [7, 8, 9],
157+
"f": {
151158
"g": {
152159
"h": 123,
153160
"i": "hello",
@@ -163,14 +170,18 @@ Will be flattened to:
163170

164171
```json
165172
{
166-
"a.b.c": [1,2,3],
167-
"d": ["foo","bar"],
168-
"e.f": [7,8,9],
169-
"e.g.h": 123,
170-
"e.g.i": "hello",
171-
"e.g.j.k": true
173+
"a.b.c": "{\"d\":[1,2,3]}",
174+
"e": "[\"foo\",\"bar\"]",
175+
"f.g.h": 123,
176+
"f.g.i": "hello",
177+
"f.g.j": "{\"k\":true}"
172178
}
173179
```
174180

181+
Note that:
182+
- Arrays at any level are always converted to JSON strings (e.g., `"e"` becomes `"[\"foo\",\"bar\"]"`)
183+
- When the nesting level limit is reached (level 3 in this example), the remaining nested objects are converted to JSON strings (e.g., `"a.b.c"` and `"f.g.j"`)
184+
- Regular scalar values within the depth limit are stored as their native types (e.g., `"f.g.h"` as integer, `"f.g.i"` as string)
185+
175186

176187

docs/user-guide/ingest-data/for-observability/vector.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ password = "<password>"
142142

143143
[sinks.my_sink_id.extra_params]
144144
source = "vector"
145-
x-greptime-pipeline-params = "flatten_json_object=true"
145+
x-greptime-pipeline-params = "max_nested_levels=10"
146146
```
147147

148148
This example demonstrates how to use `greptimedb_logs` sink to write generated demo logs data to GreptimeDB. For more information, please refer to [Vector greptimedb_logs sink](https://vector.dev/docs/reference/configuration/sinks/greptimedb_logs/) documentation.

i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/built-in-pipelines.md

Lines changed: 43 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,22 @@ GreptimeDB 提供了常见日志格式的内置 Pipeline,允许你直接使用
1111

1212
## `greptime_identity`
1313

14-
`greptime_identity` Pipeline 适用于写入 JSON 日志,并自动为 JSON 日志中的每个字段创建列。
14+
`greptime_identity` Pipeline 适用于写入 JSON 日志,并自动为 JSON 日志中的每个字段创建列。嵌套的 JSON 对象将自动展开为使用点符号的单独列。
1515

16-
- JSON 日志中的第一层级的 key 是表中的列名。
17-
- 如果相同字段包含不同类型的数据,则会返回错误。
18-
- 值为 `null` 的字段将被忽略。
19-
- 如果没有手动指定,一个作为时间索引的额外列 `greptime_timestamp` 将被添加到表中,以指示日志写入的时间。
16+
- 嵌套对象会被自动展开(例如,`{"a": {"b": 1}}` 变成列 `a.b`
17+
- 数组会被转换为 JSON 字符串
18+
- 如果相同字段包含不同类型的数据,则会返回错误
19+
- 值为 `null` 的字段将被忽略
20+
- 如果没有手动指定,一个作为时间索引的额外列 `greptime_timestamp` 将被添加到表中,以指示日志写入的时间
2021

2122
### 类型转换规则
2223

2324
- `string` -> `string`
2425
- `number` -> `int64``float64`
2526
- `boolean` -> `bool`
2627
- `null` -> 忽略
27-
- `array` -> `json`
28-
- `object` -> `json`
28+
- `array` -> `string`(JSON 字符串格式)
29+
- `object` -> 自动展开为单独的列(参见[展开 JSON 对象](#展开-json-对象)
2930

3031
例如,如果我们有以下 JSON 数据:
3132

@@ -37,7 +38,7 @@ GreptimeDB 提供了常见日志格式的内置 Pipeline,允许你直接使用
3738
]
3839
```
3940

40-
我们将合并每个批次的行结构以获得最终 schema。表 schema 如下所示:
41+
我们将合并每个批次的行结构以获得最终 schema。注意,嵌套对象会自动展开为单独的列(例如 `object.a``object.b`),数组会转换为 JSON 字符串。表 schema 如下所示:
4142

4243
```sql
4344
mysql> desc pipeline_logs;
@@ -47,26 +48,27 @@ mysql> desc pipeline_logs;
4748
| age | Int64 | | YES | | FIELD |
4849
| is_student | Boolean | | YES | | FIELD |
4950
| name | String | | YES | | FIELD |
50-
| object | Json | | YES | | FIELD |
51+
| object.a | Int64 | | YES | | FIELD |
52+
| object.b | Int64 | | YES | | FIELD |
5153
| score | Float64 | | YES | | FIELD |
5254
| company | String | | YES | | FIELD |
53-
| array | Json | | YES | | FIELD |
55+
| array | String | | YES | | FIELD |
5456
| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP |
5557
+--------------------+---------------------+------+------+---------+---------------+
56-
8 rows in set (0.00 sec)
58+
9 rows in set (0.00 sec)
5759
```
5860

5961
数据将存储在表中,如下所示:
6062

6163
```sql
6264
mysql> select * from pipeline_logs;
63-
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
64-
| age | is_student | name | object | score | company | array | greptime_timestamp |
65-
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
66-
| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 |
67-
| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 |
68-
| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 |
69-
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
65+
+------+------------+---------+----------+----------+-------+---------+-----------+----------------------------+
66+
| age | is_student | name | object.a | object.b | score | company | array | greptime_timestamp |
67+
+------+------------+---------+----------+----------+-------+---------+-----------+----------------------------+
68+
| 22 | 1 | Charlie | NULL | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 |
69+
| 21 | 0 | NULL | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 |
70+
| 20 | 1 | Alice | 1 | 2 | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 |
71+
+------+------------+---------+----------+----------+-------+---------+-----------+----------------------------+
7072
3 rows in set (0.01 sec)
7173
```
7274

@@ -117,35 +119,40 @@ DESC pipeline_logs;
117119
- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z`
118120

119121

120-
### 展开 json 对象
122+
### 展开 JSON 对象
121123

122-
如果你希望将 JSON 对象展开为单层结构,可以在请求的 header 中添加 `x-greptime-pipeline-params` 参数,设置 `flatten_json_object``true`
124+
`greptime_identity` pipeline **自动展开**嵌套的 JSON 对象为单层结构。此行为始终启用,使用点符号(例如 `a.b.c`)为每个嵌套字段创建单独的列。
125+
126+
#### 控制展开深度
127+
128+
你可以使用 `x-greptime-pipeline-params` header 中的 `max_nested_levels` 参数来控制对象展开的深度。默认值为 10 层。
123129

124130
以下是一个示例请求:
125131

126132
```shell
127133
curl -X "POST" "http://localhost:4000/v1/ingest?db=<db-name>&table=<table-name>&pipeline_name=greptime_identity&version=<pipeline-version>" \
128134
-H "Content-Type: application/x-ndjson" \
129135
-H "Authorization: Basic {{authentication}}" \
130-
-H "x-greptime-pipeline-params: flatten_json_object=true" \
136+
-H "x-greptime-pipeline-params: max_nested_levels=5" \
131137
-d "$<log-items>"
132138
```
133139

134-
这样,GreptimeDB 将自动将 JSON 对象的每个字段展开为单独的列。比如
140+
当达到最大嵌套级别时,任何剩余的嵌套结构都会被转换为 JSON 字符串并存储在单个列中。例如,当 `max_nested_levels=3` 时:
135141

136142
```JSON
137143
{
138144
"a": {
139145
"b": {
140-
"c": [1, 2, 3]
146+
"c": {
147+
"d": [1, 2, 3]
148+
}
141149
}
142150
},
143-
"d": [
151+
"e": [
144152
"foo",
145153
"bar"
146154
],
147-
"e": {
148-
"f": [7, 8, 9],
155+
"f": {
149156
"g": {
150157
"h": 123,
151158
"i": "hello",
@@ -161,12 +168,16 @@ curl -X "POST" "http://localhost:4000/v1/ingest?db=<db-name>&table=<table-name>&
161168

162169
```json
163170
{
164-
"a.b.c": [1,2,3],
165-
"d": ["foo","bar"],
166-
"e.f": [7,8,9],
167-
"e.g.h": 123,
168-
"e.g.i": "hello",
169-
"e.g.j.k": true
171+
"a.b.c": "{\"d\":[1,2,3]}",
172+
"e": "[\"foo\",\"bar\"]",
173+
"f.g.h": 123,
174+
"f.g.i": "hello",
175+
"f.g.j": "{\"k\":true}"
170176
}
171177
```
172178

179+
注意:
180+
- 任何级别的数组都会被转换为 JSON 字符串(例如,`"e"` 变成 `"[\"foo\",\"bar\"]"`
181+
- 当达到嵌套级别限制时(此例中为第 3 层),剩余的嵌套对象会被转换为 JSON 字符串(例如 `"a.b.c"``"f.g.j"`
182+
- 深度限制内的常规标量值以其原生类型存储(例如 `"f.g.h"` 为整数,`"f.g.i"` 为字符串)
183+

i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/vector.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,7 @@ password = "<password>"
149149

150150
[sinks.my_sink_id.extra_params]
151151
source = "vector"
152-
x-greptime-pipeline-params = "flatten_json_object=true"
152+
x-greptime-pipeline-params = "max_nested_levels=10"
153153
```
154154

155155
此示例展示了如何使用 `greptimedb_logs` sink 将生成的 demo 日志数据写入 GreptimeDB。更多信息请参考 [Vector greptimedb_logs sink](https://vector.dev/docs/reference/configuration/sinks/greptimedb_logs/) 文档。

0 commit comments

Comments
 (0)