You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -202,6 +202,7 @@ It defines an index flow like this:
202
202
|[Custom Output Files](examples/custom_output_files)| Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets*|
203
203
|[Patient intake form extraction](examples/patient_intake_extraction)| Use LLM to extract structured data from patient intake forms with different formats |
204
204
|[HackerNews Trending Topics](examples/hn_trending_topics)| Extract trending topics from HackerNews threads and comments, using *CocoIndex Custom Source* and LLM |
205
+
|[Patient Intake Form Extraction with BAML](examples/patient_intake_extraction_baml)| Extract structured data from patient intake forms using BAML |
Copy file name to clipboardExpand all lines: docs/docs/core/flow_methods.mdx
+38-40Lines changed: 38 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,9 +13,9 @@ After a flow is defined as discussed in [Flow Definition](/docs/core/flow_def),
13
13
14
14
It can be achieved in two ways:
15
15
16
-
*Use [CocoIndex CLI](/docs/core/cli).
16
+
* Use [CocoIndex CLI](/docs/core/cli).
17
17
18
-
*Use APIs provided by the library.
18
+
* Use APIs provided by the library.
19
19
You have a `cocoindex.Flow` object after defining the flow in your code, and you can interact with it later.
20
20
21
21
The following sections assume you have a flow `demo_flow`:
@@ -38,20 +38,20 @@ It creates a `demo_flow` object in `cocoindex.Flow` type.
38
38
39
39
For a flow, its persistent backends need to be ready before it can run, including:
40
40
41
-
*[Internal storage](/docs/core/basics#internal-storage) for CocoIndex.
42
-
*Backend resources for targets exported by the flow, e.g. a table (in relational databases), a collection (in some vector databases), etc.
41
+
*[Internal storage](/docs/core/basics#internal-storage) for CocoIndex.
42
+
* Backend resources for targets exported by the flow, e.g. a table (in relational databases), a collection (in some vector databases), etc.
43
43
44
44
The desired state of the backends for a flow is derived based on the flow definition itself.
45
45
CocoIndex supports two types of actions to manage the persistent backends automatically:
46
46
47
-
**Setup* a flow, which will change the backends owned by the flow to the desired state, e.g. create new tables for new flow, drop an existing table if the corresponding target is gone, add new column to a target table if a new field is collected, etc. It's no-op if the backend states are already in the desired state.
47
+
**Setup* a flow, which will change the backends owned by the flow to the desired state, e.g. create new tables for new flow, drop an existing table if the corresponding target is gone, add new column to a target table if a new field is collected, etc. It's no-op if the backend states are already in the desired state.
48
48
49
-
**Drop* a flow, which will drop all backends owned by the flow. It's no-op if there are no existing backends owned by the flow (e.g. never setup or already dropped).
49
+
**Drop* a flow, which will drop all backends owned by the flow. It's no-op if there are no existing backends owned by the flow (e.g. never setup or already dropped).
50
50
51
51
### CLI
52
52
53
53
`cocoindex setup` subcommand will setup all flows.
54
-
`cocoindex update` and `cocoindex server` also provide a `--setup` option to setup the flow if needed before performing the main action of updating or starting the server.
54
+
`cocoindex update` and `cocoindex server` also also setupthe flow if needed before performing the main action of updating or starting the server, with prompt confirmation.
55
55
56
56
`cocoindex drop` subcommand will drop all flows.
57
57
@@ -62,8 +62,8 @@ CocoIndex supports two types of actions to manage the persistent backends automa
62
62
63
63
`Flow` provides the following APIs to setup / drop individual flows:
64
64
65
-
*`setup(report_to_stdout: bool = False)`: Setup the flow.
66
-
*`drop(report_to_stdout: bool = False)`: Drop the flow.
65
+
*`setup(report_to_stdout: bool = False)`: Setup the flow.
66
+
*`drop(report_to_stdout: bool = False)`: Drop the flow.
Besides, CocoIndex also provides APIs to setup / drop all flows at once:
89
88
90
-
*`setup_all_flows(report_to_stdout: bool = False)`: Setup all flows.
91
-
*`drop_all_flows(report_to_stdout: bool = False)`: Drop all flows.
89
+
*`setup_all_flows(report_to_stdout: bool = False)`: Setup all flows.
90
+
*`drop_all_flows(report_to_stdout: bool = False)`: Drop all flows.
92
91
93
92
For example:
94
93
@@ -113,12 +112,12 @@ If you want to remove the flow from the current process, you can call `demo_flow
113
112
The major goal of a flow is to perform the transformations on source data and build/update data in the target.
114
113
This action has two modes:
115
114
116
-
***One time update.**
115
+
***One time update.**
117
116
It builds/update the target data based on source data up to the current moment.
118
117
After the target data is at least as fresh as the source data when update starts, it's done.
119
118
It fits into situations that you need to access the fresh target data at certain time points.
120
119
121
-
***Live update.**
120
+
***Live update.**
122
121
During live update, a one time update is performed first, then it continuously captures changes from the source data and updates the target data accordingly.
123
122
It's long-running and only stops when being aborted explicitly.
124
123
It fits into situations that you need to access the fresh target data continuously in most of the time.
@@ -133,7 +132,7 @@ This is to achieve best efficiency.
133
132
134
133
Besides major update modes, CocoIndex also support the following options:
135
134
136
-
***Reexport targets**.
135
+
***Reexport targets**.
137
136
When this is enabled, even if both of the source data and flow definition are not changed, CocoIndex will still reprocess and reexport the targets.
138
137
It's helpful when you want to reload the target data, e.g. after some data loss.
139
138
Note that when this is enabled on live update mode, reexport only happens for the initial one time update.
@@ -153,7 +152,7 @@ cocoindex update main
153
152
With a `--setup` option, it will also setup the flow first if needed.
154
153
155
154
```sh
156
-
cocoindex update --setup main
155
+
cocoindex update main
157
156
```
158
157
159
158
With a `--reexport` option, it will reexport the targets even if there's no change.
A `FlowLiveUpdater` object supports the following methods:
259
257
260
-
*`start()`: Start the updater.
258
+
*`start()`: Start the updater.
261
259
CocoIndex will continuously capture changes from the source data and update the target data accordingly in background threads managed by the engine.
262
260
263
-
*`abort()`: Abort the updater.
261
+
*`abort()`: Abort the updater.
264
262
265
-
*`wait()`: Wait for the updater to finish. It only unblocks in one of the following cases:
266
-
* The updater was aborted.
267
-
* A one time update is done, and live update is not enabled:
263
+
*`wait()`: Wait for the updater to finish. It only unblocks in one of the following cases:
264
+
* The updater was aborted.
265
+
* A one time update is done, and live update is not enabled:
268
266
either `live_mode` is `False`, or all data sources have no change capture mechanisms enabled.
269
267
270
-
*`next_status_updates()`: Get the next status updates.
268
+
*`next_status_updates()`: Get the next status updates.
271
269
It blocks until there's a new status updates, including the processing finishes for a bunch of source updates, and live updater stops (aborted, or no more sources to process).
272
270
You can continuously call this method in a loop to get the latest status updates and react accordingly.
273
271
274
272
It returns a `cocoindex.FlowUpdaterStatusUpdates` object, with the following properties:
275
-
*`active_sources`: Names of sources that are still active, i.e. not stopped processing. If it's empty, it means the updater is stopped.
276
-
*`updated_sources`: Names of sources with updates since last time.
273
+
*`active_sources`: Names of sources that are still active, i.e. not stopped processing. If it's empty, it means the updater is stopped.
274
+
*`updated_sources`: Names of sources with updates since last time.
277
275
You can check this to see which sources have recent updates and get processed.
278
276
279
-
*`update_stats()`: It returns the stats of the updater.
277
+
*`update_stats()`: It returns the stats of the updater.
280
278
281
279
This snippets shows the lifecycle of a live updater:
282
280
@@ -331,7 +329,7 @@ with cocoindex.FlowLiveUpdater(demo_flow) as my_updater:
331
329
332
330
CocoIndex also provides asynchronous versions of APIs for blocking operations, including:
333
331
334
-
*`start_async()` and `wait_async()`, e.g.
332
+
*`start_async()` and `wait_async()`, e.g.
335
333
336
334
```python
337
335
my_updater = cocoindex.FlowLiveUpdater(demo_flow)
@@ -347,7 +345,7 @@ CocoIndex also provides asynchronous versions of APIs for blocking operations, i
347
345
print(my_updater.update_stats())
348
346
```
349
347
350
-
*`next_status_updates_async()`, e.g.
348
+
*`next_status_updates_async()`, e.g.
351
349
352
350
```python
353
351
whileTrue:
@@ -356,7 +354,7 @@ CocoIndex also provides asynchronous versions of APIs for blocking operations, i
356
354
...
357
355
```
358
356
359
-
*Async context manager, e.g.
357
+
* Async context manager, e.g.
360
358
361
359
```python
362
360
asyncwith cocoindex.FlowLiveUpdater(demo_flow) as my_updater:
@@ -376,8 +374,8 @@ CocoIndex allows you to run the transformations defined by the flow without upda
376
374
The `cocoindex evaluate` subcommand runs the transformation and dumps flow outputs.
377
375
It takes the following options:
378
376
379
-
*`--output-dir` (optional): The directory to dump the result to. If not provided, it will use `eval_{flow_name}_{timestamp}`.
380
-
*`--no-cache` (optional): By default, we use already-cached intermediate data if available.
377
+
*`--output-dir` (optional): The directory to dump the result to. If not provided, it will use `eval_{flow_name}_{timestamp}`.
378
+
*`--no-cache` (optional): By default, we use already-cached intermediate data if available.
381
379
This flag will turn it off.
382
380
Note that we only read existing cached data without updating the cache, even if it's turned on.
383
381
@@ -396,8 +394,8 @@ The `evaluate_and_dump()` method runs the transformation and dumps flow outputs
396
394
397
395
It takes a `EvaluateAndDumpOptions` dataclass as input to configure, with the following fields:
398
396
399
-
*`output_dir` (type: `str`, required): The directory to dump the result to.
400
-
*`use_cache` (type: `bool`, default: `True`): Use already-cached intermediate data if available.
397
+
*`output_dir` (type: `str`, required): The directory to dump the result to.
398
+
*`use_cache` (type: `bool`, default: `True`): Use already-cached intermediate data if available.
401
399
Note that we only read existing cached data without updating the cache, even if it's turned on.
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
23
24
24
25
## Use Cases
26
+
25
27
A wide range of applications can be built with an effective codebase index that is always up-to-date.
26
28
27
29
- Semantic code context for AI coding agents like Claude, Codex, Gemini CLI.
@@ -45,13 +47,16 @@ The flow is composed of the following steps:
We use the `SplitRecursively` function to split the file into chunks. `SplitRecursively` is CocoIndex building block, with native integration with Tree-sitter. You need to pass in the language to the `language` parameter if you are processing code.
94
99
95
100
```python
@@ -100,11 +105,13 @@ with data_scope["files"].row() as file:
0 commit comments