From 3a80d96c6d06756e21c142278ea979c321a0614b Mon Sep 17 00:00:00 2001 From: Viraj Agarwal Date: Thu, 13 Nov 2025 11:18:12 +0530 Subject: [PATCH 1/7] DA-1253 update: rename FTS references to Search Vector Index in README and clarify usage instructions --- README.md | 94 +++++++++++++++++++------------------------------------ 1 file changed, 33 insertions(+), 61 deletions(-) diff --git a/README.md b/README.md index deba818..696ce54 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,11 @@ This is a demo app built to chat with your custom PDFs using **Couchbase Vector ## Three Implementation Options -### Option 1: Search Service (FTS) Vector Search (`chat_with_pdf_with_fts.py`) +### Option 1: Search Vector Index (`chat_with_pdf_with_search_vector_index.py`) -Uses **`CouchbaseSearchDocumentStore`** with Full Text Search (FTS) vector indexes, which offers: +Uses **`CouchbaseSearchDocumentStore`** with Search vector indexe, which offers: -- **Flexible vector search** with FTS capabilities +- **Flexible vector search** - **Rich text search** combined with vector similarity - **Complex filtering** using FTS queries - **Compatible with Couchbase 7.6+** @@ -23,9 +23,9 @@ Uses **`CouchbaseQueryDocumentStore`** with Hyperscale (BHIVe) vector index, whi - **SQL++ queries** for efficient vector retrieval - **Recommended for Couchbase 8.0+** for pure vector similarity search -### Option 3: Composite Vector Index (`chat_with_pdf.py`) +### Option 3: Composite Vector Index -Uses **`CouchbaseQueryDocumentStore`** with Composite vector index, which offers: +Use **`CouchbaseQueryDocumentStore`** with Composite vector index, which offers: - **Vector search with metadata filtering** - **Combines vector fields with scalar fields** for pre-filtering @@ -33,6 +33,8 @@ Uses **`CouchbaseQueryDocumentStore`** with Composite vector index, which offers - **Best for filtered vector search** scenarios (e.g., filter by date, category, user_id) - **Recommended for Couchbase 8.0+** when you need to filter before vector search +This demo doesn't use Composite Vector index, but you can easily do so by just removing `VECTOR` from [this line](./chat_with_pdf.py#L109) and keeping the rest same. To learn more about how Composite Vector Indexes are made, you can refer [here](https://docs.couchbase.com/cloud/vector-index/composite-vector-index.html). + ## How does it work? You can upload your PDFs with custom data & ask questions about the data in the chat box. @@ -68,25 +70,25 @@ The RAG pipeline utilizes Haystack, Couchbase Vector Search, and OpenAI models. 6. **Run the Streamlit app** ```bash - # For Hyperscale/Composite Vector Index (default) + # For Hyperscale Vector Index (default) streamlit run chat_with_pdf.py - # OR for Search Service/FTS Vector Search - streamlit run chat_with_pdf_with_fts.py + # OR for Search Vector Index + streamlit run chat_with_pdf_with_search_vector_index.py ``` 7. **Upload a PDF** - everything else is automatic! The app automatically creates: - Scopes and collections -- Vector indexes (after PDF upload for `chat_with_pdf.py`, or on startup for `chat_with_pdf_with_fts.py`) +- Vector indexes (after PDF upload for `chat_with_pdf.py`, or on startup for `chat_with_pdf_with_search_vector_index.py`) ## Which Option Should You Choose? Couchbase Capella supports three types of vector indexes: - **Hyperscale Vector Index** (`chat_with_pdf.py`) - Best for RAG/chatbot applications with pure semantic search and billions of documents -- **Composite Vector Index** (`chat_with_pdf.py`) - Best when you need to filter by metadata before vector search -- **Search Vector Index** (`chat_with_pdf_with_fts.py`) - Best for hybrid searches combining keywords, geospatial, and semantic search +- **Composite Vector Index** - Best when you need to filter by metadata before vector search +- **Search Vector Index** (`chat_with_pdf_with_search_vector_index.py`) - Best for hybrid searches combining keywords, geospatial, and semantic search > **For this PDF chat demo, we recommend Hyperscale Vector Index** for optimal performance in RAG applications. @@ -103,7 +105,7 @@ Learn more about choosing the right vector index in the [official Couchbase vect Copy the `secrets.example.toml` file in `.streamlit` folder and rename it to `secrets.toml` and replace the placeholders with the actual values for your environment -**For Hyperscale or Composite Vector Index (`chat_with_pdf.py`):** +**For Hyperscale Vector Index (`chat_with_pdf.py`):** ``` DB_CONN_STR = "" DB_USERNAME = "" @@ -114,11 +116,11 @@ DB_COLLECTION = "" OPENAI_API_KEY = "" ``` -**For Search Service / FTS (`chat_with_pdf_with_fts.py`):** +**For Search Vector Index (`chat_with_pdf_with_search_vector_index.py`):** Add one additional environment variable to the above configuration: ``` -INDEX_NAME = "" +INDEX_NAME = "" ``` ### Automatic Resource Setup @@ -132,12 +134,12 @@ The application automatically handles resource creation in the following order: **After PDF Upload (`chat_with_pdf.py`):** -3. Automatically creates the Hyperscale/Composite vector index after documents are loaded +3. Automatically creates the Hyperscale index after documents are loaded 4. Falls back to creating the index on first query if needed -**On Application Startup (`chat_with_pdf_with_fts.py`):** +**On Application Startup (`chat_with_pdf_with_search_vector_index.py`):** -3. Attempts to create the FTS index (can be created without documents) +3. Attempts to create the Search Vector index (can be created without documents) **What You Need:** - Your Couchbase **bucket must exist** with the name "sample_bucket" @@ -146,41 +148,11 @@ The application automatically handles resource creation in the following order: **Note**: For `chat_with_pdf.py`, the vector index is created automatically **after you upload your first PDF** because Hyperscale/Composite indexes require documents for training. -### Manual Vector Index Creation (Optional) - -**The application now creates indexes automatically!** This section is only needed if: -- You want to pre-create the index before uploading documents -- Automatic creation fails in your environment -- You prefer manual control over index configuration - -**For Hyperscale or Composite Vector Index (`chat_with_pdf.py`):** - -The app automatically creates the vector index after you upload your first PDF. However, you can manually create it if needed. - -This demo uses Couchbase's Vector Indexes (introduced in version 8.0). Choose between: - -- **Hyperscale Vector Index**: Optimized for pure vector search at scale. Perfect for RAG, chatbots, and scenarios needing fast vector similarity search on large datasets. - -- **Composite Vector Index**: Combines vector fields with scalar fields, allowing you to apply metadata filters before vector search (e.g., date, category, user_id). - -Learn more about these vector indexes [here](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html). - -**For Search Service / FTS (`chat_with_pdf_with_fts.py`):** - -The app attempts to create the FTS index on startup. If automatic creation fails, you can create it manually. See the FTS index creation section below for detailed instructions. - -### Key Components - -- Streamlit: Provides the web interface -- Haystack: Orchestrates the RAG pipeline -- Couchbase: Serves as the high-performance vector store -- OpenAI: Supplies embeddings and the language model - ## Manual Vector Index Creation (Optional) **⚠️ Manual creation is NOT required** - the app creates indexes automatically when you upload a PDF. This section is only for advanced users who want manual control. -### Hyperscale or Composite Vector Index (for `chat_with_pdf.py`) +### Hyperscale or Composite Vector Index You need to create a Hyperscale or Composite vector index on your collection **after** loading some documents (required for index training). Choose between BHIVe or Composite Index based on your use case. Whichever vector index (Hyperscale or Composite) you choose won't affect the functionality of this demo, though performance differences may occur. @@ -201,7 +173,7 @@ WITH { **Option 2: Composite Vector Index** -Composite indexes combine vector fields with other scalar fields. This is useful when you need to filter documents by metadata before performing vector search. +[Composite indexes](https://docs.couchbase.com/cloud/vector-index/composite-vector-index.html) combine vector fields with other scalar fields. This is useful when you need to filter documents by metadata before performing vector search. Creating a Composite Index using SQL++: @@ -232,13 +204,13 @@ SELECT * FROM system:indexes WHERE name LIKE 'idx_%_vector'; ``` -### FTS Vector Index (for `chat_with_pdf_with_fts.py`) +### Search Vector Vector Index (for `chat_with_pdf_with_search_vector_index.py`) -**Automatic Creation**: The app attempts to create the FTS index automatically on startup using the `INDEX_NAME` from your configuration. +**Automatic Creation**: The app attempts to create the Search Vector index automatically on startup using the `INDEX_NAME` from your configuration. **Manual Creation** (if automatic creation fails): Create a Full Text Search index with vector capabilities. -**Creating an FTS Index with Vector Support** +**Creating an Search Vector Index with Vector Support** If automatic creation fails, you can create the index using the Couchbase UI or by importing the provided index definition. @@ -260,9 +232,9 @@ Using Couchbase Server: 4. Paste the updated JSON in the Import screen 5. Click on Create Index -**FTS Index Definition** +**Search Vector Index Definition** -The `sampleSearchIndex.json` file contains a pre-configured FTS index with vector capabilities. Key features: +The `sampleSearchIndex.json` file contains a pre-configured Search Vector index with vector capabilities. Key features: - **Index Name**: `sample-index` (customizable) - **Vector Field**: `embedding` with 1536 dimensions - **Similarity**: `dot_product` (optimized for OpenAI embeddings) @@ -276,21 +248,21 @@ The `sampleSearchIndex.json` file contains a pre-configured FTS index with vecto streamlit run chat_with_pdf.py ``` -**For Search Service / FTS:** +**For Search Vector Index:** ``` -streamlit run chat_with_pdf_with_fts.py +streamlit run chat_with_pdf_with_search_vector_index.py ``` ## Implementation Details -### Hyperscale and Composite Vector Index Implementation (`chat_with_pdf.py`) +### Hyperscale Vector Index Implementation (`chat_with_pdf.py`) This demo uses the following key components: 1. **CouchbaseQueryDocumentStore**: - Configured with `QueryVectorSearchType.ANN` for fast approximate nearest neighbor search - Uses `QueryVectorSearchSimilarity.DOT` for dot product similarity (recommended for OpenAI embeddings) - - Supports both **Hyperscale (BHIVe)** and **Composite** indexes + - Supports both **Hyperscale** and **Composite** vector indexes - Leverages SQL++ for efficient vector retrieval - Same code works for both index types - just create the appropriate index @@ -305,7 +277,7 @@ This demo uses the following key components: For more details on implementation, refer to the extensive code comments in `chat_with_pdf.py`. -### Search Service / FTS Implementation (`chat_with_pdf_with_fts.py`) +### Search Vector Index Implementation (`chat_with_pdf_with_search_vector_index.py`) This alternative implementation uses: @@ -315,7 +287,7 @@ This alternative implementation uses: - Supports rich text search combined with vector similarity 2. **CouchbaseSearchEmbeddingRetriever**: - - Leverages FTS vector search capabilities + - Leverages Search vector search capabilities - Retrieves top-k most similar documents using FTS queries - Supports complex filtering with FTS query syntax @@ -323,7 +295,7 @@ This alternative implementation uses: - Same `text-embedding-ada-002` model with 1536 dimensions - Generates embeddings for both documents and queries -For more details on FTS implementation, refer to the code comments in `chat_with_pdf_with_fts.py`. +For more details on FTS implementation, refer to the code comments in `chat_with_pdf_with_search_vector_index.py`. ## Additional Resources From 95eb315654cfeb4c32844e97072dc74b7f65dd43 Mon Sep 17 00:00:00 2001 From: Viraj Agarwal <91372648+VirajAgarwal1@users.noreply.github.com> Date: Thu, 13 Nov 2025 11:24:08 +0530 Subject: [PATCH 2/7] Update README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 696ce54..e446764 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ This is a demo app built to chat with your custom PDFs using **Couchbase Vector ### Option 1: Search Vector Index (`chat_with_pdf_with_search_vector_index.py`) -Uses **`CouchbaseSearchDocumentStore`** with Search vector indexe, which offers: +Uses **`CouchbaseSearchDocumentStore`** with Search vector indexes, which offers: - **Flexible vector search** - **Rich text search** combined with vector similarity From e3c7be6d91f92bc64fd6fa8e895aa5ff94a96ac3 Mon Sep 17 00:00:00 2001 From: Viraj Agarwal <91372648+VirajAgarwal1@users.noreply.github.com> Date: Thu, 13 Nov 2025 11:24:38 +0530 Subject: [PATCH 3/7] DA-1253 update: README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e446764..46b29eb 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Uses **`CouchbaseQueryDocumentStore`** with Hyperscale (BHIVe) vector index, whi ### Option 3: Composite Vector Index -Use **`CouchbaseQueryDocumentStore`** with Composite vector index, which offers: +Uses **`CouchbaseQueryDocumentStore`** with Composite vector index, which offers: - **Vector search with metadata filtering** - **Combines vector fields with scalar fields** for pre-filtering From 33c37095717daaaa40c7c2f00e1a05dadcaaa39b Mon Sep 17 00:00:00 2001 From: Viraj Agarwal <91372648+VirajAgarwal1@users.noreply.github.com> Date: Thu, 13 Nov 2025 11:24:58 +0530 Subject: [PATCH 4/7] DA-1253 update: README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 46b29eb..b6e4da9 100644 --- a/README.md +++ b/README.md @@ -204,7 +204,7 @@ SELECT * FROM system:indexes WHERE name LIKE 'idx_%_vector'; ``` -### Search Vector Vector Index (for `chat_with_pdf_with_search_vector_index.py`) +### Search Vector Index (for `chat_with_pdf_with_search_vector_index.py`) **Automatic Creation**: The app attempts to create the Search Vector index automatically on startup using the `INDEX_NAME` from your configuration. From 3968916fbae2923b64423826a203dc0ac41f9a2b Mon Sep 17 00:00:00 2001 From: Viraj Agarwal <91372648+VirajAgarwal1@users.noreply.github.com> Date: Thu, 13 Nov 2025 11:25:11 +0530 Subject: [PATCH 5/7] DA-1253 update: README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b6e4da9..466f151 100644 --- a/README.md +++ b/README.md @@ -210,7 +210,7 @@ WHERE name LIKE 'idx_%_vector'; **Manual Creation** (if automatic creation fails): Create a Full Text Search index with vector capabilities. -**Creating an Search Vector Index with Vector Support** +**Creating a Search Vector Index with Vector Support** If automatic creation fails, you can create the index using the Couchbase UI or by importing the provided index definition. From fe30c390eb4394508e4c9ea836200df2987f7b48 Mon Sep 17 00:00:00 2001 From: Viraj Agarwal Date: Thu, 13 Nov 2025 11:26:13 +0530 Subject: [PATCH 6/7] DA-1253 update: README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 466f151..ea87778 100644 --- a/README.md +++ b/README.md @@ -287,7 +287,7 @@ This alternative implementation uses: - Supports rich text search combined with vector similarity 2. **CouchbaseSearchEmbeddingRetriever**: - - Leverages Search vector search capabilities + - Leverages Search vector index capabilities - Retrieves top-k most similar documents using FTS queries - Supports complex filtering with FTS query syntax From aef9f49fd333d291d0e828f7e7c17ec7770def92 Mon Sep 17 00:00:00 2001 From: Viraj Agarwal Date: Thu, 13 Nov 2025 12:35:01 +0530 Subject: [PATCH 7/7] DA-1253 update: upgrade model from gpt-4o to gpt-5 --- chat_with_pdf.py | 2 +- chat_with_pdf_with_search_vector_index.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/chat_with_pdf.py b/chat_with_pdf.py index b99da04..6390eff 100644 --- a/chat_with_pdf.py +++ b/chat_with_pdf.py @@ -280,7 +280,7 @@ def get_document_store(): "llm", OpenAIGenerator( api_key=OPENAI_API_KEY, - model="gpt-4o", + model="gpt-5", ), ) rag_pipeline.add_component("answer_builder", AnswerBuilder()) diff --git a/chat_with_pdf_with_search_vector_index.py b/chat_with_pdf_with_search_vector_index.py index 2477e47..195e8e0 100644 --- a/chat_with_pdf_with_search_vector_index.py +++ b/chat_with_pdf_with_search_vector_index.py @@ -239,7 +239,7 @@ def get_document_store(): "llm", OpenAIGenerator( api_key=OPENAI_API_KEY, - model="gpt-4o", + model="gpt-5", ), ) rag_pipeline.add_component("answer_builder", AnswerBuilder())