You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/core/data_types.mdx
+96-12Lines changed: 96 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,11 +21,16 @@ All you need to do is to make sure the data passed to functions and targets are
21
21
Each type in CocoIndex type system is mapped to one or multiple types in Python.
22
22
When you define a [custom function](/docs/custom_ops/custom_functions), you need to annotate the data types of arguments and return values.
23
23
24
-
* When you pass a Python value to the engine (e.g. return values of a custom function), a specific type annotation is required.
25
-
The type annotation needs to be specific in describing the target data type, as it provides the ground truth of the data type in the flow.
24
+
***When you pass a Python value to the engine (e.g. return values of a custom function)**, a **specific type annotation is required**.
25
+
The type annotation needs to be specific in describing the target data type, as it provides the **ground truth of the data type in the flow**.
26
26
27
-
* When you use a Python variable to bind to an engine value (e.g. arguments of a custom function),
28
-
the engine already knows the specific data type, so we don't require a specific type annotation, e.g. type annotations can be omitted, or you can use `Any` at any level.
27
+
This is critical because CocoIndex uses return type annotations to infer data types throughout the flow without processing any actual data. This enables:
***When you use a Python variable to bind to an engine value (e.g. arguments of a custom function)**,
33
+
the engine already knows the specific data type, so we **don't require a specific type annotation**. Type annotations can be omitted, or you can use `Any` at any level.
29
34
When a specific type annotation is provided, it's still used as a guidance to construct the Python value with compatible type.
30
35
Otherwise, we will bind to a default Python type.
31
36
@@ -85,13 +90,36 @@ It's useful to hold data without fixed schema known at flow definition time.
85
90
A vector type is a collection of elements of the same basic type.
86
91
Optionally, it can have a fixed dimension. Noted as *Vector[Type]* or *Vector[Type, Dim]*, e.g. *Vector[Float32]* or *Vector[Float32, 384]*.
87
92
93
+
**When to specify vector dimension:**
94
+
Specify the dimension in return type annotations if you plan to export the vector to a target, as **most targets require a fixed vector dimension** for creating vector indexes. For example, use `cocoindex.Vector[cocoindex.Float32, typing.Literal[768]]` for 768-dimensional embeddings.
95
+
88
96
It supports the following Python types:
89
97
90
98
*`cocoindex.Vector[T]` or `cocoindex.Vector[T, typing.Literal[Dim]]`, e.g. `cocoindex.Vector[cocoindex.Float32]` or `cocoindex.Vector[cocoindex.Float32, typing.Literal[384]]`
91
99
* The underlying Python type is `numpy.typing.NDArray[T]` where `T` is a numpy numeric type (`numpy.int64`, `numpy.float32` or `numpy.float64`) or array type (`numpy.typing.NDArray[T]`), or `list[T]` otherwise
92
100
*`numpy.typing.NDArray[T]` where `T` is a numpy numeric type or array type
93
101
*`list[T]`
94
102
103
+
**Example:**
104
+
105
+
```python
106
+
from typing import Literal
107
+
import cocoindex
108
+
109
+
# ✅ Good: Specify dimension for vectors that will be exported to targets
return embedding # numpy array or list of 768 floats
115
+
116
+
# ⚠️ Works but less precise: Vector without dimension
117
+
@cocoindex.op.function(behavior_version=1)
118
+
defembed_text_no_dim(text: str) -> list[float]:
119
+
"""Generate embedding without dimension specification."""
120
+
return embedding
121
+
```
122
+
95
123
96
124
#### Union Types
97
125
@@ -146,8 +174,39 @@ class PersonModel(BaseModel):
146
174
All three examples (`Person`, `PersonTuple`, and `PersonModel`) are valid Struct types in CocoIndex, with identical schemas (three fields: `first_name` (Str), `last_name` (Str), `dob` (Date)).
147
175
Choose `dataclass` for mutable objects, `NamedTuple` for immutable lightweight structures, or `Pydantic` for data validation and serialization features.
148
176
149
-
Besides, for arguments of custom functions, CocoIndex also supports using dictionaries (`dict[str, Any]`) to represent a *Struct* type.
150
-
It's the default Python type if you don't annotate the function argument with a specific type.
177
+
**Type annotations for Struct:**
178
+
179
+
-**For return values**: Must use a specific Struct type (dataclass, NamedTuple, or Pydantic model)
180
+
-**For arguments**: Can use `dict[str, Any]` or `Any` instead of a specific Struct type. `dict[str, Any]` is the default binding if you don't annotate the function argument with a specific type.
181
+
182
+
**Example:**
183
+
184
+
```python
185
+
from dataclasses import dataclass
186
+
from typing import Any
187
+
import datetime
188
+
189
+
@dataclass
190
+
classPerson:
191
+
first_name: str
192
+
last_name: str
193
+
dob: datetime.date
194
+
195
+
# ✅ Good: Specific return type, relaxed argument type
# return {"name": person.first_name} # dict[str, str] is not a CocoIndex type
209
+
```
151
210
152
211
### Table Types
153
212
@@ -165,9 +224,12 @@ Each key column must be a [key type](#key-types). When multiple key columns are
165
224
In Python, a *KTable* type is represented by `dict[K, V]`.
166
225
`K` represents the key and `V` represents the value for each row:
167
226
168
-
-`K` can be a Struct type (either a frozen dataclass or a `NamedTuple`) that contains all key parts as fields. This is the general way to model multi-part keys.
169
-
- When there is only a single key part and it is a basic type (e.g. `str`, `int`), you may use that basic type directly as the dictionary key instead of wrapping it in a Struct.
170
-
-`V` should be the type bound to a *Struct* representing the non-key value fields of each row.
227
+
-**`K` (key type)** can be:
228
+
- A primitive [key type](#key-types) (e.g., `str`, `int`) for single-part keys
229
+
- An immutable Struct type (frozen dataclass or `NamedTuple`) for multi-part composite keys
230
+
-**`V` (value type)** must be a Struct type representing the non-key value fields of each row
231
+
- For return values: Must use a specific Struct type (dataclass, NamedTuple, or Pydantic model)
232
+
- For arguments: Can use `dict[str, Any]` or `Any`
171
233
172
234
When a specific type annotation is not provided:
173
235
- For composite keys (multiple key parts), the key binds to a Python tuple of the key parts, e.g. `tuple[str, str]`.
@@ -210,9 +272,31 @@ If you don't annotate the function argument with a specific type, it's bound to
210
272
211
273
*LTable* is a *Table* type whose row order is preserved. *LTable* has no key column.
212
274
213
-
In Python, a *LTable* type is represented by `list[R]`, where `R` is the type binding to the *Struct* type representing the value fields of each row.
214
-
For example, you can use `list[Person]`, `list[PersonTuple]`, or `list[PersonModel]` to represent a *LTable* with 3 columns: `first_name` (*Str*), `last_name` (*Str*), `dob` (*Date*).
215
-
It's bound to `list[dict[str, Any]]` if you don't annotate the function argument with a specific type.
275
+
In Python, a *LTable* type is represented by `list[R]`, where **`R` must be a Struct type** representing the value fields of each row:
276
+
-**For return values**: Must use a specific Struct type (e.g., `list[Person]`, `list[PersonTuple]`, or `list[PersonModel]`)
277
+
-**For arguments**: Can use `list[dict[str, Any]]` or `list[Any]`. Defaults to `list[dict[str, Any]]` if you don't annotate the function argument.
278
+
279
+
For example, `list[Person]` represents a *LTable* with 3 columns: `first_name` (*Str*), `last_name` (*Str*), `dob` (*Date*).
280
+
281
+
**Example:**
282
+
283
+
```python
284
+
from dataclasses import dataclass
285
+
from typing import Any
286
+
import datetime
287
+
288
+
@dataclass
289
+
classPerson:
290
+
first_name: str
291
+
last_name: str
292
+
dob: datetime.date
293
+
294
+
# ✅ Good: Return type specifies list of specific Struct
0 commit comments