Skip to content

Commit d204ed9

Browse files
authored
example: add example for BAML (#1256)
* example: add example for BAML * chore: update `pyproject.toml` and `README` * fix: `pyproject`
1 parent bc6b3ce commit d204ed9

File tree

13 files changed

+243
-0
lines changed

13 files changed

+243
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,7 @@ It defines an index flow like this:
202202
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
203203
| [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |
204204
| [HackerNews Trending Topics](examples/hn_trending_topics) | Extract trending topics from HackerNews threads and comments, using *CocoIndex Custom Source* and LLM |
205+
| [Patient Intake Form Extraction with BAML](examples/patient_intake_extraction_baml) | Extract structured data from patient intake forms using BAML |
205206

206207
More coming and stay tuned 👀!
207208

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
GEMINI_API_KEY=
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# BAML generated files
2+
baml_client/
3+
.baml/
4+
5+
# Environment files
6+
.env
7+
8+
# Python
9+
__pycache__/
10+
*.pyc
11+
*.pyo
12+
*.pyd
13+
.Python
14+
*.so
15+
*.egg-info/
16+
dist/
17+
build/
18+
19+
# CocoIndex
20+
.cocoindex/
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Extract structured data from patient intake forms with BAML
2+
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
5+
6+
This example shows how to use [BAML](https://boundaryml.com/) to extract structured data from patient intake PDFs. BAML provides type-safe structured data extraction with native PDF support.
7+
8+
- **BAML Schema** (`baml_src/patient.baml`) - Defines the data structure and extraction function
9+
- **CocoIndex Flow** (`main.py`) - Wraps BAML in a custom function, provide the flow to and process files incrementally.
10+
11+
## Prerequisites
12+
13+
1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
14+
15+
2. Install dependencies
16+
17+
```sh
18+
pip install -U cocoindex baml-py
19+
```
20+
21+
3. **Generate BAML client code** (required step!)
22+
23+
```sh
24+
baml generate
25+
```
26+
27+
This generates the `baml_client/` directory with Python code to call your BAML functions.
28+
29+
4. Create a `.env` file. You can copy it from `.env.example` first:
30+
31+
```sh
32+
cp .env.example .env
33+
```
34+
35+
Then edit the file to fill in your `GEMINI_API_KEY`.
36+
37+
## Run
38+
39+
Update index:
40+
41+
```sh
42+
cocoindex update main
43+
```
44+
45+
## CocoInsight
46+
47+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with zero pipeline data retention. Run following command to start CocoInsight:
48+
49+
```sh
50+
cocoindex server -ci main
51+
```
52+
53+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
// BAML Generator Configuration
2+
3+
generator python_client {
4+
output_type python/pydantic
5+
output_dir "../"
6+
version "0.212.0"
7+
}
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
// BAML Schema for Patient Intake Form Extraction
2+
3+
class Contact {
4+
name string
5+
phone string
6+
relationship string
7+
}
8+
9+
class Address {
10+
street string
11+
city string
12+
state string
13+
zip_code string
14+
}
15+
16+
class Pharmacy {
17+
name string
18+
phone string
19+
address Address
20+
}
21+
22+
class Insurance {
23+
provider string
24+
policy_number string
25+
group_number string?
26+
policyholder_name string
27+
relationship_to_patient string
28+
}
29+
30+
class Condition {
31+
name string
32+
diagnosed bool
33+
}
34+
35+
class Medication {
36+
name string
37+
dosage string
38+
}
39+
40+
class Allergy {
41+
name string
42+
}
43+
44+
class Surgery {
45+
name string
46+
date string
47+
}
48+
49+
class Patient {
50+
name string
51+
dob string
52+
gender string
53+
address Address
54+
phone string
55+
email string
56+
preferred_contact_method string
57+
emergency_contact Contact
58+
insurance Insurance?
59+
reason_for_visit string
60+
symptoms_duration string
61+
past_conditions Condition[]
62+
current_medications Medication[]
63+
allergies Allergy[]
64+
surgeries Surgery[]
65+
occupation string?
66+
pharmacy Pharmacy?
67+
consent_given bool
68+
consent_date string?
69+
}
70+
71+
function ExtractPatientInfo(intake_form: pdf) -> Patient {
72+
client Gemini
73+
prompt #"
74+
Extract all patient information from the following intake form document.
75+
Please be thorough and extract all available information accurately.
76+
77+
{{ intake_form }}
78+
79+
Fill in with "N/A" for required fields if the information is not available.
80+
81+
{{ ctx.output_format }}
82+
"#
83+
}
84+
85+
client<llm> Gemini {
86+
provider google-ai
87+
options {
88+
model gemini-2.5-flash
89+
api_key env.GEMINI_API_KEY
90+
}
91+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
## Note:
2+
Example files here are purely artificial and not real, for testing purposes only.
3+
Please do not use these examples for any other purpose.
4+

0 commit comments

Comments
 (0)