Skip to content

Commit 3395382

Browse files
pamelafoxmattgotteinerCopilot
authored
Cloud ingestion strategy with prepdocs as custom skillset for Azure AI Search Blob Indexer (#2819)
* Convert prepdocs to skills * More Bicep to get funcs deployed with auth * chore(functions): add missing prepdocslib dependencies to function requirements * build(functions): vendor dependencies into .python_packages for flex consumption * chore(functions): copy backend requirements as requirements.backend.txt for traceability * chore(functions): overwrite function requirements with backend pins (backup original) * chore(functions): remove requirements backup; always overwrite with backend pins * Get function apps deployed * Updates to function auth * latest changes to get auth working * Fix tests * always upload local files * update to storageMetadata extraction * Got it working * Working more on the docs * Update * Push latest for review * Consolidate docs * Clean up vectorization docs and refs * More code cleanup * Address Copilot feedback on tests * More code cleanups * Cleanup function test * 100% diff coverage * Update app/functions/document_extractor/function_app.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update app/backend/prepdocslib/page.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update app/functions/document_extractor/function_app.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Address feedback and tweak docs * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Adding diagram --------- Co-authored-by: Matt Gotteiner <matthew.gotteiner@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 4d933cc commit 3395382

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+6676
-765
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,8 @@ npm-debug.log*
148148
node_modules
149149
static/
150150

151+
app/functions/*/prepdocslib/
152+
151153
data/**/*.md5
152154

153155
.DS_Store

AGENTS.md

Lines changed: 58 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,30 @@ If necessary, edit this file to ensure it accurately reflects the current state
1717
* app/backend/approaches/prompts/chat_query_rewrite.prompty: Prompt used to rewrite the query based off search history into a better search query
1818
* app/backend/approaches/prompts/chat_query_rewrite_tools.json: Tools used by the query rewriting prompt
1919
* app/backend/approaches/prompts/chat_answer_question.prompty: Prompt used by the Chat approach to actually answer the question based off sources
20+
* app/backend/prepdocslib: Contains the document ingestion library used by both local and cloud ingestion
21+
* app/backend/prepdocslib/blobmanager.py: Manages uploads to Azure Blob Storage
22+
* app/backend/prepdocslib/cloudingestionstrategy.py: Builds the Azure AI Search indexer and skillset for the cloud ingestion pipeline
23+
* app/backend/prepdocslib/csvparser.py: Parses CSV files
24+
* app/backend/prepdocslib/embeddings.py: Generates embeddings for text and images using Azure OpenAI
25+
* app/backend/prepdocslib/figureprocessor.py: Generates figure descriptions for both local ingestion and the cloud figure-processor skill
26+
* app/backend/prepdocslib/fileprocessor.py: Orchestrates parsing and chunking of individual files
27+
* app/backend/prepdocslib/filestrategy.py: Strategy for uploading and indexing files (local ingestion)
28+
* app/backend/prepdocslib/htmlparser.py: Parses HTML files
29+
* app/backend/prepdocslib/integratedvectorizerstrategy.py: Strategy using Azure AI Search integrated vectorization
30+
* app/backend/prepdocslib/jsonparser.py: Parses JSON files
31+
* app/backend/prepdocslib/listfilestrategy.py: Lists files from local filesystem or Azure Data Lake
32+
* app/backend/prepdocslib/mediadescriber.py: Interfaces for describing images (Azure OpenAI GPT-4o, Content Understanding)
33+
* app/backend/prepdocslib/page.py: Data classes for pages, images, and chunks
34+
* app/backend/prepdocslib/parser.py: Base parser interface
35+
* app/backend/prepdocslib/pdfparser.py: Parses PDFs using Azure Document Intelligence or local parser
36+
* app/backend/prepdocslib/searchmanager.py: Manages Azure AI Search index creation and updates
37+
* app/backend/prepdocslib/servicesetup.py: Shared service setup helpers for OpenAI, embeddings, blob storage, etc.
38+
* app/backend/prepdocslib/strategy.py: Base strategy interface for document ingestion
39+
* app/backend/prepdocslib/textparser.py: Parses plain text and markdown files
40+
* app/backend/prepdocslib/textprocessor.py: Processes text chunks for cloud ingestion (merges figures, generates embeddings)
41+
* app/backend/prepdocslib/textsplitter.py: Splits text into chunks using different strategies
2042
* app/backend/app.py: The main entry point for the backend application.
43+
* app/functions: Azure Functions used for cloud ingestion custom skills (document extraction, figure processing, text processing). Each function bundles a synchronized copy of `prepdocslib`; run `python scripts/copy_prepdocslib.py` to refresh the local copies if you modify the library.
2144
* app/frontend: Contains the React frontend code, built with TypeScript, built with vite.
2245
* app/frontend/src/api: Contains the API client code for communicating with the backend.
2346
* app/frontend/src/components: Contains the React components for the frontend.
@@ -65,7 +88,7 @@ When adding a new developer setting, update:
6588
* app/backend/approaches/retrievethenread.py : Retrieve from overrides parameter
6689
* app/backend/app.py: Some settings may need to be sent down in the /config route.
6790

68-
## When adding tests for a new feature:
91+
## When adding tests for a new feature
6992

7093
All tests are in the `tests` folder and use the pytest framework.
7194
There are three styles of tests:
@@ -124,3 +147,37 @@ cd scripts && mypy . --config-file=../pyproject.toml
124147

125148
Note that we do not currently enforce type hints in the tests folder, as it would require adding a lot of `# type: ignore` comments to the existing tests.
126149
We only enforce type hints in the main application code and scripts.
150+
151+
## Python code style
152+
153+
Do not use single underscores in front of "private" methods or variables in Python code. We do not follow that convention in this codebase, since this is an application and not a library.
154+
155+
## Deploying the application
156+
157+
To deploy the application, use the `azd` CLI tool. Make sure you have the latest version of the `azd` CLI installed. Then, run the following command from the root of the repository:
158+
159+
```shell
160+
azd up
161+
```
162+
163+
That command will BOTH provision the Azure resources AND deploy the application code.
164+
165+
If you only changed the Bicep templates and want to re-provision the Azure resources, run:
166+
167+
```shell
168+
azd provision
169+
```
170+
171+
If you only changed the application code and want to re-deploy the code, run:
172+
173+
```shell
174+
azd deploy
175+
```
176+
177+
If you are using cloud ingestion and only want to deploy individual functions, run the necessary deploy commands, for example:
178+
179+
```shell
180+
azd deploy document-extractor
181+
azd deploy figure-processor
182+
azd deploy text-processor
183+
```

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ The repo includes sample data so it's ready to try end to end. In this sample ap
6060
- Chat (multi-turn) and Q&A (single turn) interfaces
6161
- Renders citations and thought process for each answer
6262
- Includes settings directly in the UI to tweak the behavior and experiment with options
63-
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [integrated vectorization](/docs/data_ingestion.md#overview-of-integrated-vectorization)
63+
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [cloud data ingestion](/docs/data_ingestion.md#cloud-data-ingestion)
6464
- Optional usage of [multimodal models](/docs/multimodal.md) to reason over image-heavy documents
6565
- Optional addition of [speech input/output](/docs/deploy_features.md#enabling-speech-inputoutput) for accessibility
6666
- Optional automation of [user login and data access](/docs/login_and_acl.md) via Microsoft Entra

app/backend/app.py

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -467,6 +467,7 @@ async def setup_clients():
467467
USE_CHAT_HISTORY_BROWSER = os.getenv("USE_CHAT_HISTORY_BROWSER", "").lower() == "true"
468468
USE_CHAT_HISTORY_COSMOS = os.getenv("USE_CHAT_HISTORY_COSMOS", "").lower() == "true"
469469
USE_AGENTIC_RETRIEVAL = os.getenv("USE_AGENTIC_RETRIEVAL", "").lower() == "true"
470+
USE_VECTORS = os.getenv("USE_VECTORS", "").lower() != "false"
470471

471472
# WEBSITE_HOSTNAME is always set by App Service, RUNNING_IN_PRODUCTION is set in main.bicep
472473
RUNNING_ON_AZURE = os.getenv("WEBSITE_HOSTNAME") is not None or os.getenv("RUNNING_IN_PRODUCTION") is not None
@@ -582,7 +583,7 @@ async def setup_clients():
582583
current_app.config[CONFIG_USER_BLOB_MANAGER] = user_blob_manager
583584

584585
# Set up ingester
585-
file_processors = setup_file_processors(
586+
file_processors, figure_processor = setup_file_processors(
586587
azure_credential=azure_credential,
587588
document_intelligence_service=os.getenv("AZURE_DOCUMENTINTELLIGENCE_SERVICE"),
588589
local_pdf_parser=os.getenv("USE_LOCAL_PDF_PARSER", "").lower() == "true",
@@ -594,18 +595,21 @@ async def setup_clients():
594595
openai_model=OPENAI_CHATGPT_MODEL,
595596
openai_deployment=AZURE_OPENAI_CHATGPT_DEPLOYMENT if OPENAI_HOST == OpenAIHost.AZURE else None,
596597
)
597-
search_info = await setup_search_info(
598+
search_info = setup_search_info(
598599
search_service=AZURE_SEARCH_SERVICE, index_name=AZURE_SEARCH_INDEX, azure_credential=azure_credential
599600
)
600-
text_embeddings_service = setup_embeddings_service(
601-
open_ai_client=openai_client,
602-
openai_host=OPENAI_HOST,
603-
emb_model_name=OPENAI_EMB_MODEL,
604-
emb_model_dimensions=OPENAI_EMB_DIMENSIONS,
605-
azure_openai_deployment=AZURE_OPENAI_EMB_DEPLOYMENT,
606-
azure_openai_endpoint=azure_openai_endpoint,
607-
disable_vectors=os.getenv("USE_VECTORS", "").lower() == "false",
608-
)
601+
602+
text_embeddings_service = None
603+
if USE_VECTORS:
604+
text_embeddings_service = setup_embeddings_service(
605+
open_ai_client=openai_client,
606+
openai_host=OPENAI_HOST,
607+
emb_model_name=OPENAI_EMB_MODEL,
608+
emb_model_dimensions=OPENAI_EMB_DIMENSIONS,
609+
azure_openai_deployment=AZURE_OPENAI_EMB_DEPLOYMENT,
610+
azure_openai_endpoint=azure_openai_endpoint,
611+
)
612+
609613
image_embeddings_service = setup_image_embeddings_service(
610614
azure_credential=azure_credential,
611615
vision_endpoint=AZURE_VISION_ENDPOINT,
@@ -618,6 +622,7 @@ async def setup_clients():
618622
image_embeddings=image_embeddings_service,
619623
search_field_name_embedding=AZURE_SEARCH_FIELD_NAME_EMBEDDING,
620624
blob_manager=user_blob_manager,
625+
figure_processor=figure_processor,
621626
)
622627
current_app.config[CONFIG_INGESTER] = ingester
623628

@@ -640,7 +645,7 @@ async def setup_clients():
640645
OPENAI_CHATGPT_MODEL not in Approach.GPT_REASONING_MODELS
641646
or Approach.GPT_REASONING_MODELS[OPENAI_CHATGPT_MODEL].streaming
642647
)
643-
current_app.config[CONFIG_VECTOR_SEARCH_ENABLED] = os.getenv("USE_VECTORS", "").lower() != "false"
648+
current_app.config[CONFIG_VECTOR_SEARCH_ENABLED] = bool(USE_VECTORS)
644649
current_app.config[CONFIG_USER_UPLOAD_ENABLED] = bool(USE_USER_UPLOAD)
645650
current_app.config[CONFIG_LANGUAGE_PICKER_ENABLED] = ENABLE_LANGUAGE_PICKER
646651
current_app.config[CONFIG_SPEECH_INPUT_ENABLED] = USE_SPEECH_INPUT_BROWSER

0 commit comments

Comments
 (0)