Skip to content

Commit bd45876

Browse files
Merge branch 'main' into function-timeout
2 parents 91b2f1a + 002eb00 commit bd45876

File tree

75 files changed

+1726
-674
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+1726
-674
lines changed

Cargo.lock

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,7 @@ It defines an index flow like this:
202202
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
203203
| [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |
204204
| [HackerNews Trending Topics](examples/hn_trending_topics) | Extract trending topics from HackerNews threads and comments, using *CocoIndex Custom Source* and LLM |
205+
| [Patient Intake Form Extraction with BAML](examples/patient_intake_extraction_baml) | Extract structured data from patient intake forms using BAML |
205206

206207
More coming and stay tuned 👀!
207208

docs/docs/core/flow_methods.mdx

Lines changed: 38 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ After a flow is defined as discussed in [Flow Definition](/docs/core/flow_def),
1313

1414
It can be achieved in two ways:
1515

16-
* Use [CocoIndex CLI](/docs/core/cli).
16+
* Use [CocoIndex CLI](/docs/core/cli).
1717

18-
* Use APIs provided by the library.
18+
* Use APIs provided by the library.
1919
You have a `cocoindex.Flow` object after defining the flow in your code, and you can interact with it later.
2020

2121
The following sections assume you have a flow `demo_flow`:
@@ -38,20 +38,20 @@ It creates a `demo_flow` object in `cocoindex.Flow` type.
3838

3939
For a flow, its persistent backends need to be ready before it can run, including:
4040

41-
* [Internal storage](/docs/core/basics#internal-storage) for CocoIndex.
42-
* Backend resources for targets exported by the flow, e.g. a table (in relational databases), a collection (in some vector databases), etc.
41+
* [Internal storage](/docs/core/basics#internal-storage) for CocoIndex.
42+
* Backend resources for targets exported by the flow, e.g. a table (in relational databases), a collection (in some vector databases), etc.
4343

4444
The desired state of the backends for a flow is derived based on the flow definition itself.
4545
CocoIndex supports two types of actions to manage the persistent backends automatically:
4646

47-
* *Setup* a flow, which will change the backends owned by the flow to the desired state, e.g. create new tables for new flow, drop an existing table if the corresponding target is gone, add new column to a target table if a new field is collected, etc. It's no-op if the backend states are already in the desired state.
47+
* *Setup* a flow, which will change the backends owned by the flow to the desired state, e.g. create new tables for new flow, drop an existing table if the corresponding target is gone, add new column to a target table if a new field is collected, etc. It's no-op if the backend states are already in the desired state.
4848

49-
* *Drop* a flow, which will drop all backends owned by the flow. It's no-op if there are no existing backends owned by the flow (e.g. never setup or already dropped).
49+
* *Drop* a flow, which will drop all backends owned by the flow. It's no-op if there are no existing backends owned by the flow (e.g. never setup or already dropped).
5050

5151
### CLI
5252

5353
`cocoindex setup` subcommand will setup all flows.
54-
`cocoindex update` and `cocoindex server` also provide a `--setup` option to setup the flow if needed before performing the main action of updating or starting the server.
54+
`cocoindex update` and `cocoindex server` also also setup the flow if needed before performing the main action of updating or starting the server, with prompt confirmation.
5555

5656
`cocoindex drop` subcommand will drop all flows.
5757

@@ -62,8 +62,8 @@ CocoIndex supports two types of actions to manage the persistent backends automa
6262

6363
`Flow` provides the following APIs to setup / drop individual flows:
6464

65-
* `setup(report_to_stdout: bool = False)`: Setup the flow.
66-
* `drop(report_to_stdout: bool = False)`: Drop the flow.
65+
* `setup(report_to_stdout: bool = False)`: Setup the flow.
66+
* `drop(report_to_stdout: bool = False)`: Drop the flow.
6767

6868
For example:
6969

@@ -74,8 +74,8 @@ demo_flow.drop(report_to_stdout=True)
7474

7575
We also provide the following asynchronous versions of the APIs:
7676

77-
* `setup_async(report_to_stdout: bool = False)`: Setup the flow asynchronously.
78-
* `drop_async(report_to_stdout: bool = False)`: Drop the flow asynchronously.
77+
* `setup_async(report_to_stdout: bool = False)`: Setup the flow asynchronously.
78+
* `drop_async(report_to_stdout: bool = False)`: Drop the flow asynchronously.
7979

8080
For example:
8181

@@ -84,11 +84,10 @@ await demo_flow.setup_async(report_to_stdout=True)
8484
await demo_flow.drop_async(report_to_stdout=True)
8585
```
8686

87-
8887
Besides, CocoIndex also provides APIs to setup / drop all flows at once:
8988

90-
* `setup_all_flows(report_to_stdout: bool = False)`: Setup all flows.
91-
* `drop_all_flows(report_to_stdout: bool = False)`: Drop all flows.
89+
* `setup_all_flows(report_to_stdout: bool = False)`: Setup all flows.
90+
* `drop_all_flows(report_to_stdout: bool = False)`: Drop all flows.
9291

9392
For example:
9493

@@ -113,12 +112,12 @@ If you want to remove the flow from the current process, you can call `demo_flow
113112
The major goal of a flow is to perform the transformations on source data and build/update data in the target.
114113
This action has two modes:
115114

116-
* **One time update.**
115+
* **One time update.**
117116
It builds/update the target data based on source data up to the current moment.
118117
After the target data is at least as fresh as the source data when update starts, it's done.
119118
It fits into situations that you need to access the fresh target data at certain time points.
120119

121-
* **Live update.**
120+
* **Live update.**
122121
During live update, a one time update is performed first, then it continuously captures changes from the source data and updates the target data accordingly.
123122
It's long-running and only stops when being aborted explicitly.
124123
It fits into situations that you need to access the fresh target data continuously in most of the time.
@@ -133,7 +132,7 @@ This is to achieve best efficiency.
133132

134133
Besides major update modes, CocoIndex also support the following options:
135134

136-
* **Reexport targets**.
135+
* **Reexport targets**.
137136
When this is enabled, even if both of the source data and flow definition are not changed, CocoIndex will still reprocess and reexport the targets.
138137
It's helpful when you want to reload the target data, e.g. after some data loss.
139138
Note that when this is enabled on live update mode, reexport only happens for the initial one time update.
@@ -153,7 +152,7 @@ cocoindex update main
153152
With a `--setup` option, it will also setup the flow first if needed.
154153

155154
```sh
156-
cocoindex update --setup main
155+
cocoindex update main
157156
```
158157

159158
With a `--reexport` option, it will reexport the targets even if there's no change.
@@ -169,7 +168,6 @@ cocoindex update --reexport main.py
169168

170169
The `Flow.update()` method creates/updates data in the target.
171170

172-
173171
```python
174172
stats = demo_flow.update()
175173
print(stats)
@@ -207,9 +205,9 @@ await demo_flow.update_async(reexport_targets=True)
207205

208206
A data source may enable one or multiple *change capture mechanisms*:
209207

210-
* Configured with a [refresh interval](flow_def#refresh-interval), which is generally applicable to all data sources.
208+
* Configured with a [refresh interval](flow_def#refresh-interval), which is generally applicable to all data sources.
211209

212-
* Specific data sources also provide their specific change capture mechanisms.
210+
* Specific data sources also provide their specific change capture mechanisms.
213211
For example, [`Postgres` source](../sources/postgres) listens to PostgreSQL's change notifications, [`AmazonS3` source](../sources/amazons3) watches S3 bucket's change events, and [`GoogleDrive` source](../sources/googledrive) allows polling recent modified files.
214212
See documentations for specific data sources.
215213

@@ -236,13 +234,13 @@ Otherwise, it falls back to the same behavior as one time update, and will finis
236234
To perform live update, you need to create a `cocoindex.FlowLiveUpdater` object using the `cocoindex.Flow` object.
237235
It takes an optional `cocoindex.FlowLiveUpdaterOptions` option, with the following fields:
238236

239-
* `live_mode` (type: `bool`, default: `True`):
237+
* `live_mode` (type: `bool`, default: `True`):
240238
Whether to perform live update for data sources with change capture mechanisms.
241239
It has no effect for data sources without any change capture mechanism.
242240

243-
* `print_stats` (type: `bool`, default: `False`): Whether to print stats during update.
241+
* `print_stats` (type: `bool`, default: `False`): Whether to print stats during update.
244242

245-
* `reexport_targets` (type: `bool`, default: `False`): Whether to reexport the targets even if there's no change.
243+
* `reexport_targets` (type: `bool`, default: `False`): Whether to reexport the targets even if there's no change.
246244

247245
Note that `cocoindex.FlowLiveUpdater` provides a unified interface for both one-time update and live update.
248246
It only performs live update when `live_mode` is `True`, and only for sources with change capture mechanisms enabled.
@@ -257,26 +255,26 @@ my_updater = cocoindex.FlowLiveUpdater(
257255

258256
A `FlowLiveUpdater` object supports the following methods:
259257

260-
* `start()`: Start the updater.
258+
* `start()`: Start the updater.
261259
CocoIndex will continuously capture changes from the source data and update the target data accordingly in background threads managed by the engine.
262260

263-
* `abort()`: Abort the updater.
261+
* `abort()`: Abort the updater.
264262

265-
* `wait()`: Wait for the updater to finish. It only unblocks in one of the following cases:
266-
* The updater was aborted.
267-
* A one time update is done, and live update is not enabled:
263+
* `wait()`: Wait for the updater to finish. It only unblocks in one of the following cases:
264+
* The updater was aborted.
265+
* A one time update is done, and live update is not enabled:
268266
either `live_mode` is `False`, or all data sources have no change capture mechanisms enabled.
269267

270-
* `next_status_updates()`: Get the next status updates.
268+
* `next_status_updates()`: Get the next status updates.
271269
It blocks until there's a new status updates, including the processing finishes for a bunch of source updates, and live updater stops (aborted, or no more sources to process).
272270
You can continuously call this method in a loop to get the latest status updates and react accordingly.
273271

274272
It returns a `cocoindex.FlowUpdaterStatusUpdates` object, with the following properties:
275-
* `active_sources`: Names of sources that are still active, i.e. not stopped processing. If it's empty, it means the updater is stopped.
276-
* `updated_sources`: Names of sources with updates since last time.
273+
* `active_sources`: Names of sources that are still active, i.e. not stopped processing. If it's empty, it means the updater is stopped.
274+
* `updated_sources`: Names of sources with updates since last time.
277275
You can check this to see which sources have recent updates and get processed.
278276

279-
* `update_stats()`: It returns the stats of the updater.
277+
* `update_stats()`: It returns the stats of the updater.
280278

281279
This snippets shows the lifecycle of a live updater:
282280

@@ -331,7 +329,7 @@ with cocoindex.FlowLiveUpdater(demo_flow) as my_updater:
331329

332330
CocoIndex also provides asynchronous versions of APIs for blocking operations, including:
333331

334-
* `start_async()` and `wait_async()`, e.g.
332+
* `start_async()` and `wait_async()`, e.g.
335333

336334
```python
337335
my_updater = cocoindex.FlowLiveUpdater(demo_flow)
@@ -347,7 +345,7 @@ CocoIndex also provides asynchronous versions of APIs for blocking operations, i
347345
print(my_updater.update_stats())
348346
```
349347

350-
* `next_status_updates_async()`, e.g.
348+
* `next_status_updates_async()`, e.g.
351349

352350
```python
353351
while True:
@@ -356,7 +354,7 @@ CocoIndex also provides asynchronous versions of APIs for blocking operations, i
356354
...
357355
```
358356

359-
* Async context manager, e.g.
357+
* Async context manager, e.g.
360358

361359
```python
362360
async with cocoindex.FlowLiveUpdater(demo_flow) as my_updater:
@@ -376,8 +374,8 @@ CocoIndex allows you to run the transformations defined by the flow without upda
376374
The `cocoindex evaluate` subcommand runs the transformation and dumps flow outputs.
377375
It takes the following options:
378376

379-
* `--output-dir` (optional): The directory to dump the result to. If not provided, it will use `eval_{flow_name}_{timestamp}`.
380-
* `--no-cache` (optional): By default, we use already-cached intermediate data if available.
377+
* `--output-dir` (optional): The directory to dump the result to. If not provided, it will use `eval_{flow_name}_{timestamp}`.
378+
* `--no-cache` (optional): By default, we use already-cached intermediate data if available.
381379
This flag will turn it off.
382380
Note that we only read existing cached data without updating the cache, even if it's turned on.
383381

@@ -396,8 +394,8 @@ The `evaluate_and_dump()` method runs the transformation and dumps flow outputs
396394

397395
It takes a `EvaluateAndDumpOptions` dataclass as input to configure, with the following fields:
398396

399-
* `output_dir` (type: `str`, required): The directory to dump the result to.
400-
* `use_cache` (type: `bool`, default: `True`): Use already-cached intermediate data if available.
397+
* `output_dir` (type: `str`, required): The directory to dump the result to.
398+
* `use_cache` (type: `bool`, default: `True`): Use already-cached intermediate data if available.
401399
Note that we only read existing cached data without updating the cache, even if it's turned on.
402400

403401
Example:

docs/docs/custom_ops/custom_functions.mdx

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,8 @@ Custom functions take the following additional parameters:
145145
* `batching: bool`: Whether the executor will consume requests in batch.
146146
See the [Batching](#batching) section below for details.
147147

148+
* `max_batch_size: int | None`: The maximum batch size for the executor.
149+
148150
* `behavior_version: int`: The version of the behavior of the function.
149151
When the version is changed, the function will be re-executed even if cache is enabled.
150152
It's required to be set if `cache` is `True`.
@@ -221,5 +223,25 @@ class ComputeSomethingExecutor:
221223
...
222224
```
223225

226+
### Controlling Batch Size
227+
228+
You can control the maximum batch size using the `max_batch_size` parameter. This is useful for:
229+
* Limiting memory usage when processing large batches
230+
* Reducing latency by flushing batches before they grow too large
231+
* Working with APIs that have request size limits
232+
233+
```python
234+
@cocoindex.op.function(batching=True, max_batch_size=32)
235+
def compute_something(args: list[str]) -> list[str]:
236+
...
237+
```
238+
239+
With `max_batch_size` set, a batch will be flushed when either:
240+
241+
1. No ongoing batches are running
242+
2. The pending batch size reaches `max_batch_size`
243+
244+
This ensures that requests don't wait indefinitely for a batch to fill up, while still allowing efficient batch processing.
245+
224246
</TabItem>
225247
</Tabs>

docs/docs/examples/examples/00_codebase_index.md

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,11 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
1919
![Codebase Index](/img/examples/codebase_index/cover.png)
2020

2121
## Overview
22+
2223
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
2324

2425
## Use Cases
26+
2527
A wide range of applications can be built with an effective codebase index that is always up-to-date.
2628

2729
- Semantic code context for AI coding agents like Claude, Codex, Gemini CLI.
@@ -45,13 +47,16 @@ The flow is composed of the following steps:
4547
- Store in a vector database for retrieval
4648

4749
## Setup
50+
4851
- Install Postgres, follow [installation guide](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
4952
- Install CocoIndex
53+
5054
```bash
5155
pip install -U cocoindex
5256
```
5357

54-
## Add the codebase as a source.
58+
## Add the codebase as a source
59+
5560
We will index the CocoIndex codebase. Here we use the `LocalFile` source to ingest files from the CocoIndex codebase root directory.
5661

5762
```python
@@ -72,7 +77,6 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
7277
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
7378
<DocumentationButton url="https://cocoindex.io/docs/sources" text="Sources" />
7479

75-
7680
## Process each file and collect the information
7781

7882
### Extract the extension of a filename
@@ -90,6 +94,7 @@ def extract_extension(filename: str) -> str:
9094
<DocumentationButton url="https://cocoindex.io/docs/custom_ops/custom_functions" text="Custom Function" margin="0 0 16px 0" />
9195

9296
### Split the file into chunks
97+
9398
We use the `SplitRecursively` function to split the file into chunks. `SplitRecursively` is CocoIndex building block, with native integration with Tree-sitter. You need to pass in the language to the `language` parameter if you are processing code.
9499

95100
```python
@@ -100,11 +105,13 @@ with data_scope["files"].row() as file:
100105
cocoindex.functions.SplitRecursively(),
101106
language=file["extension"], chunk_size=1000, chunk_overlap=300)
102107
```
108+
103109
<DocumentationButton url="https://cocoindex.io/docs/ops/functions#splitrecursively" text="SplitRecursively" margin="0 0 16px 0" />
104110

105111
![SplitRecursively](/img/examples/codebase_index/chunk.png)
106112

107113
### Embed the chunks
114+
108115
We use `SentenceTransformerEmbed` to embed the chunks.
109116

110117
```python
@@ -146,6 +153,7 @@ code_embeddings.export(
146153
We use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity between the query and the indexed data.
147154

148155
## Query the index
156+
149157
We match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
150158

151159
```python
@@ -197,18 +205,21 @@ if __name__ == "__main__":
197205
## Run the index setup & update
198206

199207
- Install dependencies
208+
200209
```bash
201210
pip install -e .
202211
```
203212

204213
- Setup and update the index
214+
205215
```sh
206-
cocoindex update --setup main
216+
cocoindex update main
207217
```
208-
You'll see the index updates state in the terminal
209218

219+
You'll see the index updates state in the terminal
210220
211221
## Test the query
222+
212223
At this point, you can start the CocoIndex server and develop your RAG runtime against the data. To test your index, you could
213224
214225
``` bash
@@ -219,14 +230,15 @@ When you see the prompt, you can enter your search query. for example: spec.
219230
The returned results - each entry contains score (Cosine Similarity), filename, and the code snippet that get matched.
220231
221232
## CocoInsight
233+
222234
To get a better understanding of the indexing flow, you can use CocoInsight to help the development step by step.
223235
To spin up, it is super easy.
224236
225237
```
226238
cocoindex server main.py -ci
227239
```
228-
Follow the url from the terminal - `https://cocoindex.io/cocoinsight` to access the CocoInsight.
229240
241+
Follow the url from the terminal - `https://cocoindex.io/cocoinsight` to access the CocoInsight.
230242
231243
## Supported Languages
232244

0 commit comments

Comments
 (0)