You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cloud ingestion strategy with prepdocs as custom skillset for Azure AI Search Blob Indexer (#2819)
* Convert prepdocs to skills
* More Bicep to get funcs deployed with auth
* chore(functions): add missing prepdocslib dependencies to function requirements
* build(functions): vendor dependencies into .python_packages for flex consumption
* chore(functions): copy backend requirements as requirements.backend.txt for traceability
* chore(functions): overwrite function requirements with backend pins (backup original)
* chore(functions): remove requirements backup; always overwrite with backend pins
* Get function apps deployed
* Updates to function auth
* latest changes to get auth working
* Fix tests
* always upload local files
* update to storageMetadata extraction
* Got it working
* Working more on the docs
* Update
* Push latest for review
* Consolidate docs
* Clean up vectorization docs and refs
* More code cleanup
* Address Copilot feedback on tests
* More code cleanups
* Cleanup function test
* 100% diff coverage
* Update app/functions/document_extractor/function_app.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update app/backend/prepdocslib/page.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update app/functions/document_extractor/function_app.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Address feedback and tweak docs
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Adding diagram
---------
Co-authored-by: Matt Gotteiner <matthew.gotteiner@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* app/backend/prepdocslib/page.py: Data classes for pages, images, and chunks
34
+
* app/backend/prepdocslib/parser.py: Base parser interface
35
+
* app/backend/prepdocslib/pdfparser.py: Parses PDFs using Azure Document Intelligence or local parser
36
+
* app/backend/prepdocslib/searchmanager.py: Manages Azure AI Search index creation and updates
37
+
* app/backend/prepdocslib/servicesetup.py: Shared service setup helpers for OpenAI, embeddings, blob storage, etc.
38
+
* app/backend/prepdocslib/strategy.py: Base strategy interface for document ingestion
39
+
* app/backend/prepdocslib/textparser.py: Parses plain text and markdown files
40
+
* app/backend/prepdocslib/textprocessor.py: Processes text chunks for cloud ingestion (merges figures, generates embeddings)
41
+
* app/backend/prepdocslib/textsplitter.py: Splits text into chunks using different strategies
20
42
* app/backend/app.py: The main entry point for the backend application.
43
+
* app/functions: Azure Functions used for cloud ingestion custom skills (document extraction, figure processing, text processing). Each function bundles a synchronized copy of `prepdocslib`; run `python scripts/copy_prepdocslib.py` to refresh the local copies if you modify the library.
21
44
* app/frontend: Contains the React frontend code, built with TypeScript, built with vite.
22
45
* app/frontend/src/api: Contains the API client code for communicating with the backend.
23
46
* app/frontend/src/components: Contains the React components for the frontend.
@@ -65,7 +88,7 @@ When adding a new developer setting, update:
65
88
* app/backend/approaches/retrievethenread.py : Retrieve from overrides parameter
66
89
* app/backend/app.py: Some settings may need to be sent down in the /config route.
67
90
68
-
## When adding tests for a new feature:
91
+
## When adding tests for a new feature
69
92
70
93
All tests are in the `tests` folder and use the pytest framework.
71
94
There are three styles of tests:
@@ -124,3 +147,37 @@ cd scripts && mypy . --config-file=../pyproject.toml
124
147
125
148
Note that we do not currently enforce type hints in the tests folder, as it would require adding a lot of `# type: ignore` comments to the existing tests.
126
149
We only enforce type hints in the main application code and scripts.
150
+
151
+
## Python code style
152
+
153
+
Do not use single underscores in front of "private" methods or variables in Python code. We do not follow that convention in this codebase, since this is an application and not a library.
154
+
155
+
## Deploying the application
156
+
157
+
To deploy the application, use the `azd` CLI tool. Make sure you have the latest version of the `azd` CLI installed. Then, run the following command from the root of the repository:
158
+
159
+
```shell
160
+
azd up
161
+
```
162
+
163
+
That command will BOTH provision the Azure resources AND deploy the application code.
164
+
165
+
If you only changed the Bicep templates and want to re-provision the Azure resources, run:
166
+
167
+
```shell
168
+
azd provision
169
+
```
170
+
171
+
If you only changed the application code and want to re-deploy the code, run:
172
+
173
+
```shell
174
+
azd deploy
175
+
```
176
+
177
+
If you are using cloud ingestion and only want to deploy individual functions, run the necessary deploy commands, for example:
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -60,7 +60,7 @@ The repo includes sample data so it's ready to try end to end. In this sample ap
60
60
- Chat (multi-turn) and Q&A (single turn) interfaces
61
61
- Renders citations and thought process for each answer
62
62
- Includes settings directly in the UI to tweak the behavior and experiment with options
63
-
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [integrated vectorization](/docs/data_ingestion.md#overview-of-integrated-vectorization)
63
+
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [cloud data ingestion](/docs/data_ingestion.md#cloud-data-ingestion)
64
64
- Optional usage of [multimodal models](/docs/multimodal.md) to reason over image-heavy documents
65
65
- Optional addition of [speech input/output](/docs/deploy_features.md#enabling-speech-inputoutput) for accessibility
66
66
- Optional automation of [user login and data access](/docs/login_and_acl.md) via Microsoft Entra
0 commit comments