Skip to content

Commit 267f131

Browse files
authored
docs: add clarification for data types (#1206)
1 parent c903601 commit 267f131

File tree

1 file changed

+96
-12
lines changed

1 file changed

+96
-12
lines changed

docs/docs/core/data_types.mdx

Lines changed: 96 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,16 @@ All you need to do is to make sure the data passed to functions and targets are
2121
Each type in CocoIndex type system is mapped to one or multiple types in Python.
2222
When you define a [custom function](/docs/custom_ops/custom_functions), you need to annotate the data types of arguments and return values.
2323

24-
* When you pass a Python value to the engine (e.g. return values of a custom function), a specific type annotation is required.
25-
The type annotation needs to be specific in describing the target data type, as it provides the ground truth of the data type in the flow.
24+
* **When you pass a Python value to the engine (e.g. return values of a custom function)**, a **specific type annotation is required**.
25+
The type annotation needs to be specific in describing the target data type, as it provides the **ground truth of the data type in the flow**.
2626

27-
* When you use a Python variable to bind to an engine value (e.g. arguments of a custom function),
28-
the engine already knows the specific data type, so we don't require a specific type annotation, e.g. type annotations can be omitted, or you can use `Any` at any level.
27+
This is critical because CocoIndex uses return type annotations to infer data types throughout the flow without processing any actual data. This enables:
28+
- Creating proper target schemas (e.g., vector indexes with fixed dimensions)
29+
- Type checking during flow definition
30+
- Clear documentation of data transformations
31+
32+
* **When you use a Python variable to bind to an engine value (e.g. arguments of a custom function)**,
33+
the engine already knows the specific data type, so we **don't require a specific type annotation**. Type annotations can be omitted, or you can use `Any` at any level.
2934
When a specific type annotation is provided, it's still used as a guidance to construct the Python value with compatible type.
3035
Otherwise, we will bind to a default Python type.
3136

@@ -85,13 +90,36 @@ It's useful to hold data without fixed schema known at flow definition time.
8590
A vector type is a collection of elements of the same basic type.
8691
Optionally, it can have a fixed dimension. Noted as *Vector[Type]* or *Vector[Type, Dim]*, e.g. *Vector[Float32]* or *Vector[Float32, 384]*.
8792

93+
**When to specify vector dimension:**
94+
Specify the dimension in return type annotations if you plan to export the vector to a target, as **most targets require a fixed vector dimension** for creating vector indexes. For example, use `cocoindex.Vector[cocoindex.Float32, typing.Literal[768]]` for 768-dimensional embeddings.
95+
8896
It supports the following Python types:
8997

9098
* `cocoindex.Vector[T]` or `cocoindex.Vector[T, typing.Literal[Dim]]`, e.g. `cocoindex.Vector[cocoindex.Float32]` or `cocoindex.Vector[cocoindex.Float32, typing.Literal[384]]`
9199
* The underlying Python type is `numpy.typing.NDArray[T]` where `T` is a numpy numeric type (`numpy.int64`, `numpy.float32` or `numpy.float64`) or array type (`numpy.typing.NDArray[T]`), or `list[T]` otherwise
92100
* `numpy.typing.NDArray[T]` where `T` is a numpy numeric type or array type
93101
* `list[T]`
94102

103+
**Example:**
104+
105+
```python
106+
from typing import Literal
107+
import cocoindex
108+
109+
# ✅ Good: Specify dimension for vectors that will be exported to targets
110+
@cocoindex.op.function(behavior_version=1)
111+
def embed_text(text: str) -> cocoindex.Vector[cocoindex.Float32, Literal[768]]:
112+
"""Generate 768-dimensional embedding."""
113+
# ... embedding logic ...
114+
return embedding # numpy array or list of 768 floats
115+
116+
# ⚠️ Works but less precise: Vector without dimension
117+
@cocoindex.op.function(behavior_version=1)
118+
def embed_text_no_dim(text: str) -> list[float]:
119+
"""Generate embedding without dimension specification."""
120+
return embedding
121+
```
122+
95123

96124
#### Union Types
97125

@@ -146,8 +174,39 @@ class PersonModel(BaseModel):
146174
All three examples (`Person`, `PersonTuple`, and `PersonModel`) are valid Struct types in CocoIndex, with identical schemas (three fields: `first_name` (Str), `last_name` (Str), `dob` (Date)).
147175
Choose `dataclass` for mutable objects, `NamedTuple` for immutable lightweight structures, or `Pydantic` for data validation and serialization features.
148176

149-
Besides, for arguments of custom functions, CocoIndex also supports using dictionaries (`dict[str, Any]`) to represent a *Struct* type.
150-
It's the default Python type if you don't annotate the function argument with a specific type.
177+
**Type annotations for Struct:**
178+
179+
- **For return values**: Must use a specific Struct type (dataclass, NamedTuple, or Pydantic model)
180+
- **For arguments**: Can use `dict[str, Any]` or `Any` instead of a specific Struct type. `dict[str, Any]` is the default binding if you don't annotate the function argument with a specific type.
181+
182+
**Example:**
183+
184+
```python
185+
from dataclasses import dataclass
186+
from typing import Any
187+
import datetime
188+
189+
@dataclass
190+
class Person:
191+
first_name: str
192+
last_name: str
193+
dob: datetime.date
194+
195+
# ✅ Good: Specific return type, relaxed argument type
196+
@cocoindex.op.function(behavior_version=1)
197+
def process_person(person_data: dict[str, Any]) -> Person:
198+
"""Argument can use dict[str, Any], return must be specific Struct."""
199+
return Person(
200+
first_name=person_data["first_name"],
201+
last_name=person_data["last_name"],
202+
dob=person_data["dob"]
203+
)
204+
205+
# ❌ Wrong: Return type is not a valid specific CocoIndex type
206+
# @cocoindex.op.function(behavior_version=1)
207+
# def bad_example(person: Person) -> dict[str, str]:
208+
# return {"name": person.first_name} # dict[str, str] is not a CocoIndex type
209+
```
151210

152211
### Table Types
153212

@@ -165,9 +224,12 @@ Each key column must be a [key type](#key-types). When multiple key columns are
165224
In Python, a *KTable* type is represented by `dict[K, V]`.
166225
`K` represents the key and `V` represents the value for each row:
167226

168-
- `K` can be a Struct type (either a frozen dataclass or a `NamedTuple`) that contains all key parts as fields. This is the general way to model multi-part keys.
169-
- When there is only a single key part and it is a basic type (e.g. `str`, `int`), you may use that basic type directly as the dictionary key instead of wrapping it in a Struct.
170-
- `V` should be the type bound to a *Struct* representing the non-key value fields of each row.
227+
- **`K` (key type)** can be:
228+
- A primitive [key type](#key-types) (e.g., `str`, `int`) for single-part keys
229+
- An immutable Struct type (frozen dataclass or `NamedTuple`) for multi-part composite keys
230+
- **`V` (value type)** must be a Struct type representing the non-key value fields of each row
231+
- For return values: Must use a specific Struct type (dataclass, NamedTuple, or Pydantic model)
232+
- For arguments: Can use `dict[str, Any]` or `Any`
171233

172234
When a specific type annotation is not provided:
173235
- For composite keys (multiple key parts), the key binds to a Python tuple of the key parts, e.g. `tuple[str, str]`.
@@ -210,9 +272,31 @@ If you don't annotate the function argument with a specific type, it's bound to
210272

211273
*LTable* is a *Table* type whose row order is preserved. *LTable* has no key column.
212274

213-
In Python, a *LTable* type is represented by `list[R]`, where `R` is the type binding to the *Struct* type representing the value fields of each row.
214-
For example, you can use `list[Person]`, `list[PersonTuple]`, or `list[PersonModel]` to represent a *LTable* with 3 columns: `first_name` (*Str*), `last_name` (*Str*), `dob` (*Date*).
215-
It's bound to `list[dict[str, Any]]` if you don't annotate the function argument with a specific type.
275+
In Python, a *LTable* type is represented by `list[R]`, where **`R` must be a Struct type** representing the value fields of each row:
276+
- **For return values**: Must use a specific Struct type (e.g., `list[Person]`, `list[PersonTuple]`, or `list[PersonModel]`)
277+
- **For arguments**: Can use `list[dict[str, Any]]` or `list[Any]`. Defaults to `list[dict[str, Any]]` if you don't annotate the function argument.
278+
279+
For example, `list[Person]` represents a *LTable* with 3 columns: `first_name` (*Str*), `last_name` (*Str*), `dob` (*Date*).
280+
281+
**Example:**
282+
283+
```python
284+
from dataclasses import dataclass
285+
from typing import Any
286+
import datetime
287+
288+
@dataclass
289+
class Person:
290+
first_name: str
291+
last_name: str
292+
dob: datetime.date
293+
294+
# ✅ Good: Return type specifies list of specific Struct
295+
@cocoindex.op.function(behavior_version=1)
296+
def filter_adults(people: list[Any]) -> list[Person]:
297+
"""Filter people - argument relaxed, return type specific."""
298+
return [p for p in people if p["age"] >= 18]
299+
```
216300

217301
## Key Types
218302

0 commit comments

Comments
 (0)