From cb26aeb58a042bdb4cef09f960e155398c290077 Mon Sep 17 00:00:00 2001 From: Timna Brown <24630902+brown9804@users.noreply.github.com> Date: Mon, 19 May 2025 22:47:45 -0600 Subject: [PATCH 1/9] update format From d8ccf7aa0ac744444b92201a3f60d39668ebb590 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Tue, 20 May 2025 04:48:03 +0000 Subject: [PATCH 2/9] Fix Markdown syntax issues --- README.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index ebe2549..d3ff8d7 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,6 @@ Last updated: 2025-05-20 ---------- -
List of References (Click to expand) @@ -30,21 +29,21 @@ Last updated: 2025-05-20 - [Where to start?](#where-to-start) - [Important Considerations for Production Environment](#important-considerations-for-production-environment) - [Overview](#overview) - - [Function App Hosting Options](#function-app-hosting-options) + - [Function App Hosting Options](#function-app-hosting-options) - [Step 1: Set Up Your Azure Environment](#step-1-set-up-your-azure-environment) - [Step 2: Set Up Azure Blob Storage for PDF Ingestion](#step-2-set-up-azure-blob-storage-for-pdf-ingestion) - [Step 3: Set Up Azure Cosmos DB](#step-3-set-up-azure-cosmos-db) - [Step 4: Set Up Azure Functions for Document Ingestion and Processing](#step-4-set-up-azure-functions-for-document-ingestion-and-processing) - - [Create a Function App](#create-a-function-app) - - [Configure/Validate the Environment variables](#configurevalidate-the-environment-variables) - - [Develop the Function](#develop-the-function) + - [Create a Function App](#create-a-function-app) + - [Configure/Validate the Environment variables](#configurevalidate-the-environment-variables) + - [Develop the Function](#develop-the-function) - [Step 5: Test the solution](#step-5-test-the-solution)
- > [!NOTE] > Limitations of this approach:
+> > - Requires significant manual effort to structure and format extracted data.
> - Limited in handling complex layouts and non-text elements like images and charts.
@@ -107,8 +106,8 @@ Last updated: 2025-05-20 - An `Azure subscription is required`. All other resources, including instructions for creating a Resource Group, are provided in this workshop. - `Contributor role assigned or any custom role that allows`: access to manage all resources, and the ability to deploy resources within subscription. - If you choose to use the Terraform approach, please ensure that: - - [Terraform is installed on your local machine](https://developer.hashicorp.com/terraform/tutorials/azure-get-started/install-cli#install-terraform). - - [Install the Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) to work with both Terraform and Azure commands. + - [Terraform is installed on your local machine](https://developer.hashicorp.com/terraform/tutorials/azure-get-started/install-cli#install-terraform). + - [Install the Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) to work with both Terraform and Azure commands. ## Where to start? @@ -125,6 +124,7 @@ This is an introductory workshop on Microsoft Fabric. Please follow as described ## Overview > Using Cosmos DB provides you with a flexible, scalable, and globally distributed database solution that can handle both structured and semi-structured data efficiently.
+> > - `Azure Blob Storage`: Store the PDF invoices.
> - `Azure Functions`: Trigger on new PDF uploads, extract data, and process it.
> - `Azure SQL Database or Cosmos DB`: Store the extracted data for querying and analytics.
@@ -211,7 +211,7 @@ This is an introductory workshop on Microsoft Fabric. Please follow as described ## Step 3: Set Up Azure Cosmos DB -> `Azure Cosmos DB` is a globally distributed,` multi-model database service provided by Microsoft Azure`. It is designed to offer high availability, scalability, and low-latency access to data for modern applications. Unlike traditional relational databases, Cosmos DB is a `NoSQL database, meaning it can handle unstructured, semi-structured, and structured data types`. `It supports multiple data models, including document, key-value, graph, and column-family, making it versatile for various use cases.`

+> `Azure Cosmos DB` is a globally distributed,`multi-model database service provided by Microsoft Azure`. It is designed to offer high availability, scalability, and low-latency access to data for modern applications. Unlike traditional relational databases, Cosmos DB is a `NoSQL database, meaning it can handle unstructured, semi-structured, and structured data types`. `It supports multiple data models, including document, key-value, graph, and column-family, making it versatile for various use cases.`

> An `Azure Cosmos DB container` is a `logical unit` within a Cosmos DB database where data is stored. `Containers are schema-agnostic, meaning they can store items with different structures. Each container is automatically partitioned to scale out across multiple servers, providing virtually unlimited throughput and storage`. Containers are the primary scalability unit in Cosmos DB, and they use a partition key to distribute data efficiently across partitions. 1. **Create a Cosmos DB Account**: @@ -320,7 +320,6 @@ This is an introductory workshop on Microsoft Fabric. Please follow as described image - 3. **Get Cosmos DB Account ID**: Run this command to get the ID of your Cosmos DB account. Record the value of the `id` property as it is required for the next step. ```powershell @@ -372,9 +371,9 @@ This is an introductory workshop on Microsoft Fabric. Please follow as described - Under `Settings`, go to `Environment variables`. And `+ Add` the following variables: - - `COSMOS_DB_ENDPOINT`: Your Cosmos DB account endpoint. - - `COSMOS_DB_KEY`: Your Cosmos DB account key. - - `contosostorageaidemo_STORAGE`: Your Storage Account connection string. +- `COSMOS_DB_ENDPOINT`: Your Cosmos DB account endpoint. +- `COSMOS_DB_KEY`: Your Cosmos DB account key. +- `contosostorageaidemo_STORAGE`: Your Storage Account connection string. image @@ -382,7 +381,7 @@ This is an introductory workshop on Microsoft Fabric. Please follow as described image - - Click on `Apply` to save your configuration. +- Click on `Apply` to save your configuration. ### Develop the Function @@ -448,9 +447,9 @@ This is an introductory workshop on Microsoft Fabric. Please follow as described > 3. **Data Extraction**: The extracted text is processed to extract invoice data. The `generate_id` function generates a unique ID for each invoice.
> 4. **Data Storage**: The processed invoice data is saved to Azure Cosmos DB in the `ContosoAIDemo` database and `Invoices` container. - > `pdfminer.six` is an open-source framework. It is a community-maintained fork of the original PDFMiner,` designed for extracting and analyzing text data from PDF documents`. The framework is built in a modular way, allowing each component to be easily replaced or extended for various purpose + > `pdfminer.six` is an open-source framework. It is a community-maintained fork of the original PDFMiner,`designed for extracting and analyzing text data from PDF documents`. The framework is built in a modular way, allowing each component to be easily replaced or extended for various purpose - - Update the `function_app.py`: +- Update the `function_app.py`: | Template Blob Trigger | Function Code updated | | --- | --- | @@ -595,6 +594,7 @@ azure-functions pdfminer.six azure-cosmos==4.3.0 ``` + - Since this function has already been tested, you can deploy your code to the function app in your subscription. If you want to test, you can use run your function locally for testing. - Click on the `Azure` icon. - Under `workspace`, click on the `Function App` icon. From b3dbcf31df0ce59c914d152435d6e0f81b327bef Mon Sep 17 00:00:00 2001 From: Timna Brown <24630902+brown9804@users.noreply.github.com> Date: Mon, 19 May 2025 22:50:28 -0600 Subject: [PATCH 3/9] cleaning format From 13bc38dd0d28fa0ac3f7a3508ab25c8914703092 Mon Sep 17 00:00:00 2001 From: Timna Brown <24630902+brown9804@users.noreply.github.com> Date: Mon, 19 May 2025 22:50:54 -0600 Subject: [PATCH 4/9] cleaning format From f75add2a2ead3431b84bb4543eaca68cf56780ff Mon Sep 17 00:00:00 2001 From: Timna Brown <24630902+brown9804@users.noreply.github.com> Date: Mon, 19 May 2025 22:51:09 -0600 Subject: [PATCH 5/9] format From 459193083eb8af3673f3f1c4dc173eee42d8d982 Mon Sep 17 00:00:00 2001 From: Timna Brown <24630902+brown9804@users.noreply.github.com> Date: Mon, 19 May 2025 22:51:23 -0600 Subject: [PATCH 6/9] format outputs --- terraform-infrastructure/outputs.tf | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/terraform-infrastructure/outputs.tf b/terraform-infrastructure/outputs.tf index d96564f..ce5bffd 100644 --- a/terraform-infrastructure/outputs.tf +++ b/terraform-infrastructure/outputs.tf @@ -43,8 +43,7 @@ output "key_vault_name" { value = azurerm_key_vault.keyvault.name } - output "cosmosdb_account_name" { description = "The name of the CosmosDB account." value = azurerm_cosmosdb_account.cosmosdb.name -} \ No newline at end of file +} From 0a052c9d69799615bfc3841ae0bb1785618c8299 Mon Sep 17 00:00:00 2001 From: Timna Brown <24630902+brown9804@users.noreply.github.com> Date: Mon, 19 May 2025 22:51:38 -0600 Subject: [PATCH 7/9] format provider --- terraform-infrastructure/provider.tf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/terraform-infrastructure/provider.tf b/terraform-infrastructure/provider.tf index 2719636..71333b4 100644 --- a/terraform-infrastructure/provider.tf +++ b/terraform-infrastructure/provider.tf @@ -22,4 +22,4 @@ provider "azurerm" { } subscription_id = var.subscription_id # Use the subscription ID variable -} \ No newline at end of file +} From ffcb91803d05082b95134555eae08da85bf5052d Mon Sep 17 00:00:00 2001 From: Timna Brown <24630902+brown9804@users.noreply.github.com> Date: Mon, 19 May 2025 22:51:51 -0600 Subject: [PATCH 8/9] format tfvars From 95cd257b0ceb1c757b5883bc421470a6642140a4 Mon Sep 17 00:00:00 2001 From: Timna Brown <24630902+brown9804@users.noreply.github.com> Date: Mon, 19 May 2025 22:52:09 -0600 Subject: [PATCH 9/9] format tvars --- terraform-infrastructure/variables.tf | 1 - 1 file changed, 1 deletion(-) diff --git a/terraform-infrastructure/variables.tf b/terraform-infrastructure/variables.tf index 8e7eea4..7bf4738 100644 --- a/terraform-infrastructure/variables.tf +++ b/terraform-infrastructure/variables.tf @@ -13,7 +13,6 @@ variable "location" { type = string } - variable "storage_account_name" { description = "The name of the storage account" type = string