diff --git a/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb deleted file mode 100644 index d31b650..0000000 --- a/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ /dev/null @@ -1,566 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# BBC News Dataset RAG Pipeline with Couchbase and OpenAI\n", - "\n", - "This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using:\n", - "- The BBC News dataset containing real-time news articles\n", - "- Couchbase Capella as the vector store with FTS (Full Text Search)\n", - "- Haystack framework for the RAG pipeline\n", - "- OpenAI for embeddings and text generation\n", - "\n", - "The system allows users to ask questions about current events and get AI-generated answers based on the latest news articles." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Installing Necessary Libraries\n", - "\n", - "To build our RAG system, we need a set of libraries. The libraries we install handle everything from connecting to databases to performing AI tasks. Each library has a specific role: Couchbase libraries manage database operations, Haystack handles AI model integrations and pipeline management, and we will use the OpenAI SDK for generating embeddings and calling OpenAI's language models." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install datasets haystack-ai couchbase-haystack openai pandas" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Importing Necessary Libraries\n", - "\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, Haystack components for RAG pipeline, embedding generation, and dataset loading." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "import base64\n", - "import logging\n", - "import sys\n", - "import time\n", - "import pandas as pd\n", - "from datetime import timedelta\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import CouchbaseException\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from haystack import Pipeline, GeneratedAnswer\n", - "from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder\n", - "from haystack.components.preprocessors import DocumentCleaner\n", - "from haystack.components.writers import DocumentWriter\n", - "from haystack.components.builders.answer_builder import AnswerBuilder\n", - "from haystack.components.builders.prompt_builder import PromptBuilder\n", - "from haystack.components.generators import OpenAIGenerator\n", - "from haystack.utils import Secret\n", - "from haystack.dataclasses import Document\n", - "\n", - "from couchbase_haystack import (\n", - " CouchbaseSearchDocumentStore,\n", - " CouchbasePasswordAuthenticator,\n", - " CouchbaseClusterOptions,\n", - " CouchbaseSearchEmbeddingRetriever,\n", - ")\n", - "from couchbase.options import KnownConfigProfiles\n", - "\n", - "# Configure logging\n", - "logger = logging.getLogger(__name__)\n", - "logger.setLevel(logging.DEBUG)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Prerequisites\n", - "\n", - "## Create and Deploy Your Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.\n", - "\n", - "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:\n", - "\n", - "* Have a multi-node Capella cluster running the Data, Query, Index, and Search services.\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n", - "\n", - "### OpenAI Models Setup\n", - "\n", - "In order to create the RAG application, we need an embedding model to ingest the documents for Vector Search and a large language model (LLM) for generating the responses based on the context. \n", - "\n", - "For this implementation, we'll use OpenAI's models which provide state-of-the-art performance for both embeddings and text generation:\n", - "\n", - "**Embedding Model**: We'll use OpenAI's `text-embedding-3-large` model, which provides high-quality embeddings with 3,072 dimensions for semantic search capabilities.\n", - "\n", - "**Large Language Model**: We'll use OpenAI's `gpt-4o` model for generating responses based on the retrieved context. This model offers excellent reasoning capabilities and can handle complex queries effectively.\n", - "\n", - "**Prerequisites for OpenAI Integration**:\n", - "* Create an OpenAI account at [platform.openai.com](https://platform.openai.com)\n", - "* Generate an API key from your OpenAI dashboard\n", - "* Ensure you have sufficient credits or a valid payment method set up\n", - "* Set up your API key as an environment variable or input it securely in the notebook\n", - "\n", - "For more details about OpenAI's models and pricing, please refer to the [OpenAI documentation](https://platform.openai.com/docs/models)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Configure Couchbase Credentials\n", - "\n", - "Enter your Couchbase and OpenAI credentials:\n", - "\n", - "**OPENAI_API_KEY** is your OpenAI API key which can be obtained from your OpenAI dashboard at [platform.openai.com](https://platform.openai.com/api-keys).\n", - "\n", - "**INDEX_NAME** is the name of the FTS search index we will use for vector search operations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "CB_CONNECTION_STRING = input(\"Couchbase Cluster URL (default: localhost): \") or \"localhost\"\n", - "CB_USERNAME = input(\"Couchbase Username (default: admin): \") or \"admin\"\n", - "CB_PASSWORD = input(\"Couchbase password (default: Password@12345): \") or \"Password@12345\"\n", - "CB_BUCKET_NAME = input(\"Couchbase Bucket: \")\n", - "CB_SCOPE_NAME = input(\"Couchbase Scope: \")\n", - "CB_COLLECTION_NAME = input(\"Couchbase Collection: \")\n", - "CB_INDEX_NAME = input(\"Vector Search Index: \")\n", - "OPENAI_API_KEY = input(\"OpenAI API Key: \")\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, CB_OPENAI_API_KEY]):\n", - " raise ValueError(\"All configuration variables must be provided.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from couchbase.cluster import Cluster \n", - "from couchbase.options import ClusterOptions\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.collections import CollectionSpec\n", - "from couchbase.management.search import SearchIndex\n", - "import json\n", - "\n", - "# Connect to Couchbase cluster\n", - "cluster = Cluster(CB_CONNECTION_STRING, ClusterOptions(\n", - " PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)))\n", - "\n", - "# Create bucket if it does not exist\n", - "bucket_manager = cluster.buckets()\n", - "try:\n", - " bucket_manager.get_bucket(CB_BUCKET_NAME)\n", - " print(f\"Bucket '{CB_BUCKET_NAME}' already exists.\")\n", - "except Exception as e:\n", - " print(f\"Bucket '{CB_BUCKET_NAME}' does not exist. Creating bucket...\")\n", - " bucket_settings = CreateBucketSettings(name=CB_BUCKET_NAME, ram_quota_mb=500)\n", - " bucket_manager.create_bucket(bucket_settings)\n", - " print(f\"Bucket '{CB_BUCKET_NAME}' created successfully.\")\n", - "\n", - "# Create scope and collection if they do not exist\n", - "collection_manager = cluster.bucket(CB_BUCKET_NAME).collections()\n", - "scopes = collection_manager.get_all_scopes()\n", - "scope_exists = any(scope.name == CB_SCOPE_NAME for scope in scopes)\n", - "\n", - "if scope_exists:\n", - " print(f\"Scope '{CB_SCOPE_NAME}' already exists.\")\n", - "else:\n", - " print(f\"Scope '{CB_SCOPE_NAME}' does not exist. Creating scope...\")\n", - " collection_manager.create_scope(CB_SCOPE_NAME)\n", - " print(f\"Scope '{CB_SCOPE_NAME}' created successfully.\")\n", - "\n", - "collections = [collection.name for scope in scopes if scope.name == CB_SCOPE_NAME for collection in scope.collections]\n", - "collection_exists = CB_COLLECTION_NAME in collections\n", - "\n", - "if collection_exists:\n", - " print(f\"Collection '{CB_COLLECTION_NAME}' already exists in scope '{CB_SCOPE_NAME}'.\")\n", - "else:\n", - " print(f\"Collection '{CB_COLLECTION_NAME}' does not exist in scope '{CB_SCOPE_NAME}'. Creating collection...\")\n", - " collection_manager.create_collection(collection_name=CB_COLLECTION_NAME, scope_name=CB_SCOPE_NAME)\n", - " print(f\"Collection '{CB_COLLECTION_NAME}' created successfully.\")\n", - "\n", - "# Create search index from search_index.json file at scope level\n", - "with open('fts_index.json', 'r') as search_file:\n", - " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", - " \n", - " # Update search index definition with user inputs\n", - " search_index_definition.name = CB_INDEX_NAME\n", - " search_index_definition.source_name = CB_BUCKET_NAME\n", - " \n", - " # Update types mapping\n", - " old_type_key = next(iter(search_index_definition.params['mapping']['types'].keys()))\n", - " type_obj = search_index_definition.params['mapping']['types'].pop(old_type_key)\n", - " search_index_definition.params['mapping']['types'][f\"{CB_SCOPE_NAME}.{CB_COLLECTION_NAME}\"] = type_obj\n", - " \n", - " search_index_name = search_index_definition.name\n", - " \n", - " # Get scope-level search manager\n", - " scope_search_manager = cluster.bucket(CB_BUCKET_NAME).scope(CB_SCOPE_NAME).search_indexes()\n", - " \n", - " try:\n", - " # Check if index exists at scope level\n", - " existing_index = scope_search_manager.get_index(search_index_name)\n", - " print(f\"Search index '{search_index_name}' already exists at scope level.\")\n", - " except Exception as e:\n", - " print(f\"Search index '{search_index_name}' does not exist at scope level. Creating search index from fts_index.json...\")\n", - " with open('fts_index.json', 'r') as search_file:\n", - " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", - " scope_search_manager.upsert_index(search_index_definition)\n", - " print(f\"Search index '{search_index_name}' created successfully at scope level.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load and Process Movie Dataset\n", - "\n", - "Load the TMDB movie dataset and prepare documents for indexing:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load TMDB dataset\n", - "print(\"Loading TMDB dataset...\")\n", - "dataset = load_dataset(\"AiresPucrs/tmdb-5000-movies\")\n", - "movies_df = pd.DataFrame(dataset['train'])\n", - "print(f\"Total movies found: {len(movies_df)}\")\n", - "\n", - "# Create documents from movie data\n", - "docs_data = []\n", - "for _, row in movies_df.iterrows():\n", - " if pd.isna(row['overview']):\n", - " continue\n", - " \n", - " try:\n", - " docs_data.append({\n", - " 'id': str(row[\"id\"]),\n", - " 'content': f\"Title: {row['title']}\\nGenres: {', '.join([genre['name'] for genre in eval(row['genres'])])}\\nOverview: {row['overview']}\",\n", - " 'metadata': {\n", - " 'title': row['title'],\n", - " 'genres': row['genres'],\n", - " 'original_language': row['original_language'],\n", - " 'popularity': float(row['popularity']),\n", - " 'release_date': row['release_date'],\n", - " 'vote_average': float(row['vote_average']),\n", - " 'vote_count': int(row['vote_count']),\n", - " 'budget': int(row['budget']),\n", - " 'revenue': int(row['revenue'])\n", - " }\n", - " })\n", - " except Exception as e:\n", - " logger.error(f\"Error processing movie {row['title']}: {e}\")\n", - "\n", - "print(f\"Created {len(docs_data)} documents with valid overviews\")\n", - "documents = [Document(id=doc['id'], content=doc['content'], meta=doc['metadata']) \n", - " for doc in docs_data]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Initialize Document Store\n", - "\n", - "Set up the Couchbase document store for storing movie data and embeddings:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize document store\n", - "document_store = CouchbaseSearchDocumentStore(\n", - " cluster_connection_string=Secret.from_token(CB_CONNECTION_STRING),\n", - " authenticator=CouchbasePasswordAuthenticator(\n", - " username=Secret.from_token(CB_USERNAME),\n", - " password=Secret.from_token(CB_PASSWORD)\n", - " ),\n", - " cluster_options=CouchbaseClusterOptions(\n", - " profile=KnownConfigProfiles.WanDevelopment,\n", - " ),\n", - " bucket=CB_BUCKET_NAME,\n", - " scope=CB_SCOPE_NAME,\n", - " collection=CB_COLLECTION_NAME,\n", - " vector_search_index=CB_INDEX_NAME,\n", - ")\n", - "\n", - "print(\"Couchbase document store initialized successfully.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Initialize Embedder for Document Embedding\n", - "\n", - "Configure the document embedder using Capella AI's endpoint and the E5 Mistral model. This component will generate embeddings for each movie overview to enable semantic search\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "embedder = OpenAIDocumentEmbedder(\n", - " api_key=Secret.from_token(OPENAI_API_KEY),\n", - " model=\"text-embedding-3-large\",\n", - ")\n", - "\n", - "rag_embedder = OpenAITextEmbedder(\n", - " api_key=Secret.from_token(OPENAI_API_KEY),\n", - " model=\"text-embedding-3-large\",\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Initialize LLM Generator\n", - "Configure the LLM generator using Capella AI's endpoint and Llama 3.1 model. This component will generate natural language responses based on the retrieved documents.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "llm = OpenAIGenerator(\n", - " api_key=Secret.from_token(OPENAI_API_KEY),\n", - " model=\"gpt-4o\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Create Indexing Pipeline\n", - "Build the pipeline for processing and indexing movie documents:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create indexing pipeline\n", - "index_pipeline = Pipeline()\n", - "index_pipeline.add_component(\"cleaner\", DocumentCleaner())\n", - "index_pipeline.add_component(\"embedder\", embedder)\n", - "index_pipeline.add_component(\"writer\", DocumentWriter(document_store=document_store))\n", - "\n", - "# Connect indexing components\n", - "index_pipeline.connect(\"cleaner.documents\", \"embedder.documents\")\n", - "index_pipeline.connect(\"embedder.documents\", \"writer.documents\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Run Indexing Pipeline\n", - "\n", - "Execute the pipeline for processing and indexing movie documents:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Run indexing pipeline\n", - "\n", - "if documents:\n", - " # Process documents in batches for better performance\n", - " batch_size = 100\n", - " total_docs = len(documents)\n", - " \n", - " for i in range(0, total_docs, batch_size):\n", - " batch = documents[i:i + batch_size]\n", - " result = index_pipeline.run({\"cleaner\": {\"documents\": batch}})\n", - " print(f\"Processed batch {i//batch_size + 1}: {len(batch)} documents\")\n", - " \n", - " print(f\"\\nSuccessfully processed {total_docs} documents\")\n", - " print(f\"Sample document metadata: {documents[0].meta}\")\n", - "else:\n", - " print(\"No documents created. Skipping indexing.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Create RAG Pipeline\n", - "\n", - "Set up the Retrieval Augmented Generation pipeline for answering questions about movies:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define RAG prompt template\n", - "prompt_template = \"\"\"\n", - "Given these documents, answer the question.\\nDocuments:\n", - "{% for doc in documents %}\n", - " {{ doc.content }}\n", - "{% endfor %}\n", - "\n", - "\\nQuestion: {{question}}\n", - "\\nAnswer:\n", - "\"\"\"\n", - "\n", - "# Create RAG pipeline\n", - "rag_pipeline = Pipeline()\n", - "\n", - "# Add components\n", - "rag_pipeline.add_component(\n", - " \"query_embedder\",\n", - " rag_embedder,\n", - ")\n", - "rag_pipeline.add_component(\"retriever\", CouchbaseSearchEmbeddingRetriever(document_store=document_store))\n", - "rag_pipeline.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template))\n", - "rag_pipeline.add_component(\"llm\",llm)\n", - "rag_pipeline.add_component(\"answer_builder\", AnswerBuilder())\n", - "\n", - "# Connect RAG components\n", - "rag_pipeline.connect(\"query_embedder\", \"retriever.query_embedding\")\n", - "rag_pipeline.connect(\"retriever.documents\", \"prompt_builder.documents\")\n", - "rag_pipeline.connect(\"prompt_builder.prompt\", \"llm.prompt\")\n", - "rag_pipeline.connect(\"llm.replies\", \"answer_builder.replies\")\n", - "rag_pipeline.connect(\"llm.meta\", \"answer_builder.meta\")\n", - "rag_pipeline.connect(\"retriever\", \"answer_builder.documents\")\n", - "\n", - "print(\"RAG pipeline created successfully.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Ask Questions About Movies\n", - "\n", - "Use the RAG pipeline to ask questions about movies and get AI-generated answers:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Example question\n", - "question = \"Who does Savva want to save from the vicious hyenas?\"\n", - "\n", - "# Run the RAG pipeline\n", - "result = rag_pipeline.run(\n", - " {\n", - " \"query_embedder\": {\"text\": question},\n", - " \"retriever\": {\"top_k\": 5},\n", - " \"prompt_builder\": {\"question\": question},\n", - " \"answer_builder\": {\"query\": question},\n", - " },\n", - " include_outputs_from={\"retriever\", \"query_embedder\"}\n", - ")\n", - "\n", - "# Get the generated answer\n", - "answer: GeneratedAnswer = result[\"answer_builder\"][\"answers\"][0]\n", - "\n", - "# Print retrieved documents\n", - "print(\"=== Retrieved Documents ===\")\n", - "retrieved_docs = result[\"retriever\"][\"documents\"]\n", - "for idx, doc in enumerate(retrieved_docs, start=1):\n", - " print(f\"Id: {doc.id} Title: {doc.meta['title']}\")\n", - "\n", - "# Print final results\n", - "print(\"\\n=== Final Answer ===\")\n", - "print(f\"Question: {answer.query}\")\n", - "print(f\"Answer: {answer.data}\")\n", - "print(\"\\nSources:\")\n", - "for doc in answer.documents:\n", - " print(f\"-> {doc.meta['title']}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Conclusion\n", - "\n", - "In this tutorial, we built a Retrieval-Augmented Generation (RAG) system using Couchbase Capella, OpenAI, and Haystack with the BBC News dataset. This demonstrates how to combine vector search capabilities with large language models to answer questions about current events using real-time information.\n", - "\n", - "The key components include:\n", - "- **Couchbase Capella** for vector storage and FTS-based retrieval\n", - "- **Haystack** for pipeline orchestration and component management \n", - "- **OpenAI** for embeddings (`text-embedding-3-large`) and text generation (`gpt-4o`)\n", - "\n", - "This approach enables AI applications to access and reason over current information that extends beyond the LLM's training data, making responses more accurate and relevant for real-world use cases." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "haystack", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/haystack/fts/frontmatter.md b/haystack/fts/frontmatter.md deleted file mode 100644 index d29ba8f..0000000 --- a/haystack/fts/frontmatter.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -# frontmatter -path: "/tutorial-openai-haystack-rag-with-fts" -title: "Retrieval-Augmented Generation (RAG) with OpenAI, Haystack and Couchbase Search Vector Index" -short_title: "RAG with OpenAI, Haystack and Couchbase Search Vector Index" -description: - - Learn how to build a semantic search engine using Couchbase's Search Vector Index. - - This tutorial demonstrates how to integrate Couchbase's vector search capabilities with the embeddings generated by OpenAI Services. - - You will understand how to perform Retrieval-Augmented Generation (RAG) using Haystack, Couchbase and OpenAI services. -content_type: tutorial -filter: sdk -technology: - - vector search -tags: - - OpenAI - - Artificial Intelligence - - Haystack - - FTS -sdk_language: - - python -length: 60 Mins ---- diff --git a/haystack/gsi/frontmatter.md b/haystack/gsi/frontmatter.md deleted file mode 100644 index 1da0623..0000000 --- a/haystack/gsi/frontmatter.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -# frontmatter -path: "/tutorial-openai-haystack-rag-with-gsi" -title: "RAG with OpenAI, Haystack and Couchbase Hyperscale and Composite Vector Indexes" -short_title: "RAG with OpenAI, Haystack and Couchbase CVI and HVI" -description: - - Learn how to build a semantic search engine using Couchbase's Hyperscale and Composite Vector Indexes. - - This tutorial demonstrates how to integrate Couchbase's GSI vector search capabilities with OpenAI embeddings. - - You will understand how to perform Retrieval-Augmented Generation (RAG) using Haystack, Couchbase and OpenAI services. -content_type: tutorial -filter: sdk -technology: - - vector search -tags: - - OpenAI - - Artificial Intelligence - - Haystack - - GSI -sdk_language: - - python -length: 60 Mins ---- diff --git a/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb similarity index 81% rename from haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb rename to haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index d07c49b..629a5ca 100644 --- a/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -6,17 +6,17 @@ "source": [ "# Introduction\n", "\n", - "In this guide, we will walk you through building a Retrieval Augmented Generation (RAG) application using Couchbase Capella as the database, [gpt-4o](https://platform.openai.com/docs/models/gpt-4o) model as the large language model provided by OpenAI. We will use the [text-embedding-3-large](https://platform.openai.com/docs/guides/embeddings/embedding-models) model for generating embeddings.\n", + "In this guide, we will walk you through building a Retrieval Augmented Generation (RAG) application with Haystack orchestrating OpenAI models and Couchbase Capella. We will use the [gpt-4o](https://platform.openai.com/docs/models/gpt-4o) model for response generation and the [text-embedding-3-large](https://platform.openai.com/docs/guides/embeddings/embedding-models) model for generating embeddings.\n", "\n", "This notebook demonstrates how to build a RAG system using:\n", "- The [BBC News dataset](https://huggingface.co/datasets/RealTimeData/bbc_news_alltime) containing news articles\n", - "- Couchbase Capella as the vector store with GSI (Global Secondary Index) for vector search\n", + "- Couchbase Capella Hyperscale and Composite Vector Indexes for vector search\n", "- Haystack framework for the RAG pipeline\n", "- OpenAI for embeddings and text generation\n", "\n", - "We leverage Couchbase's Global Secondary Index (GSI) vector search capabilities to create and manage vector indexes, enabling efficient semantic search capabilities. GSI provides high-performance vector search with support for both Hyperscale Vector Indexes and Composite Vector Indexes, designed to scale to billions of vectors with low memory footprint and optimized concurrent operations.\n", + "We leverage Couchbase's Hyperscale and Composite Vector Indexes to enable efficient semantic search at scale. Hyperscale indexes prioritize high-throughput vector similarity across billions of vectors with a compact on-disk footprint, while Composite indexes blend scalar predicates with a vector column to narrow candidate sets before similarity search. For a deeper dive into how these indexes work, see the [overview of Capella vector indexes](https://docs.couchbase.com/cloud/vector-index/vectors-and-indexes-overview.html).\n", "\n", - "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using OpenAI Services and Haystack with Couchbase's advanced GSI vector search." + "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial shows how to combine OpenAI Services and Haystack with Couchbase's Hyperscale and Composite Vector Indexes to deliver a production-ready RAG workflow." ] }, { @@ -139,7 +139,7 @@ "\n", "**OPENAI_API_KEY** is your OpenAI API key which can be obtained from your OpenAI dashboard at [platform.openai.com](https://platform.openai.com/api-keys).\n", "\n", - "**INDEX_NAME** is the name of the GSI vector index we will create for vector search operations." + "**INDEX_NAME** is the name of the Hyperscale or Composite Vector Index we will create for vector search operations." ] }, { @@ -285,7 +285,7 @@ " news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split=\"train\")\n", " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", "except Exception as e:\n", - " raise ValueError(f\"Error loading TREC dataset: {str(e)}\")" + " raise ValueError(f\"Error loading BBC News dataset: {str(e)}\")" ] }, { @@ -394,8 +394,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Setting Up the Couchbase GSI Document Store\n", - "The Couchbase GSI document store is set up to store the documents from the dataset using Couchbase's Global Secondary Index vector search capabilities. This document store is optimized for high-performance vector similarity search operations and can scale to billions of vectors using Haystack's Couchbase integration." + "# Setting Up the Couchbase Vector Document Store\n", + "The Couchbase document store configuration enables both Hyperscale and Composite Vector Indexes. This stores documents from the dataset while keeping embeddings ready for high-performance semantic search, and it scales to billions of vectors through Haystack's Couchbase integration." ] }, { @@ -405,7 +405,7 @@ "outputs": [], "source": [ "try:\n", - " # Create the Couchbase GSI document store\n", + " # Create the Couchbase vector document store\n", " document_store = CouchbaseQueryDocumentStore(\n", " cluster_connection_string=Secret.from_token(CB_CONNECTION_STRING),\n", " authenticator=CouchbasePasswordAuthenticator(\n", @@ -421,9 +421,9 @@ " search_type=QueryVectorSearchType.ANN,\n", " similarity=QueryVectorSearchSimilarity.L2\n", " )\n", - " print(\"Successfully created GSI document store\")\n", + " print(\"Successfully created Couchbase vector document store\")\n", "except Exception as e:\n", - " raise ValueError(f\"Failed to create GSI document store: {str(e)}\")" + " raise ValueError(f\"Failed to create Couchbase vector document store: {str(e)}\")" ] }, { @@ -478,7 +478,7 @@ "\n", "In this section, we'll create an indexing pipeline to process our documents. The pipeline will:\n", "\n", - "1. Split the documents into smaller chunks using the DocumentSplitter\n", + "1. Split the documents into smaller chunks using the DocumentCleaner\n", "2. Generate embeddings for each chunk using our document embedder\n", "3. Store these chunks with their embeddings in our Couchbase document store\n", "\n", @@ -523,8 +523,8 @@ "source": [ "# Run the indexing pipeline\n", "if haystack_documents:\n", - " result = indexing_pipeline.run({\"cleaner\": {\"documents\": haystack_documents}})\n", - " print(f\"Indexed {len(result['writer']['documents_written'])} document chunks\")\n", + " result = indexing_pipeline.run({\"cleaner\": {\"documents\": haystack_documents[:1200]}})\n", + " print(f\"Indexed {result['writer']['documents_written']} document chunks\")\n", "else:\n", " print(\"No documents created. Skipping indexing.\")\n" ] @@ -628,7 +628,7 @@ "\n", "This demonstrates how our system combines the power of vector search with language model capabilities to provide accurate, contextual answers based on the information in our database.\n", "\n", - "**Note:** By default, without any GSI vector index, Couchbase uses linear brute force search which compares the query vector against every document in the collection. This works for small datasets but can become slow as the dataset grows." + "**Note:** By default, without any Hyperscale or Composite Vector Index, Couchbase falls back to linear brute-force search that compares the query vector against every document in the collection. This works for small datasets but can become slow as the dataset grows." ] }, { @@ -639,7 +639,7 @@ "source": [ "# Sample query from the dataset\n", "\n", - "query = \"Who will Daniel Dubois fight in Saudi Arabia on 22 February?\"\n", + "query = \"What is latest news on the death of Charles Breslin?\"\n", "\n", "try:\n", " # Perform the semantic search using the RAG pipeline\n", @@ -670,7 +670,7 @@ " for doc in answer.documents:\n", " print(f\"-> {doc.meta['title']}\")\n", " # Display search results\n", - " print(f\"\\nOptimized GSI Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(f\"\\nLinear Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", " #print(result[\"generator\"][\"replies\"][0])\n", "\n", "except Exception as e:\n", @@ -681,39 +681,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Create GSI Vector Index (Optimized Search)\n", + "# Create Hyperscale or Composite Vector Indexes\n", "\n", - "While the above RAG system works effectively, we can significantly improve query performance by leveraging Couchbase's advanced GSI vector search capabilities.\n", - "\n", - "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", - "\n", - "In this section, we'll set up the Couchbase vector store using GSI (Global Secondary Index) for high-performance vector search.\n", - "\n", - "GSI vector search supports two main index types:\n", + "While the above RAG system works effectively, you can significantly improve query performance by enabling Couchbase Capella's Hyperscale or Composite Vector Indexes.\n", "\n", "## Hyperscale Vector Indexes\n", "- Specifically designed for vector searches\n", - "- Perform vector similarity and semantic searches faster than the other types of indexes\n", - "- Designed to scale to billions of vectors\n", - "- Most of the index resides in a highly optimized format on disk\n", - "- High accuracy even for vectors with a large number of dimensions\n", - "- Supports concurrent searches and inserts for datasets that are constantly changing\n", + "- Perform vector similarity and semantic searches faster than other index types\n", + "- Scale to billions of vectors while keeping most of the structure in an optimized on-disk format\n", + "- Maintain high accuracy even for vectors with a large number of dimensions\n", + "- Support concurrent searches and inserts for constantly changing datasets\n", "\n", - "Use this type of index when you want to primarily query vector values with a low memory footprint. In general, Hyperscale Vector indexes are the best choice for most applications that use vector searches.\n", + "Use this type of index when you primarily query vector values and need low-latency similarity search at scale. In general, Hyperscale Vector Indexes are the best starting point for most vector search workloads.\n", "\n", "## Composite Vector Indexes\n", - "- Combines a standard Global Secondary index (GSI) with a single vector column\n", - "- Designed for searches using a single vector value along with standard scalar values that filter out large portions of the dataset. The scalar attributes in a query reduce the number of vectors the Couchbase Server has to compare when performing a vector search to find similar vectors.\n", - "- Consume a moderate amount of memory and can index billions of documents.\n", - "- Work well for cases where your queries are highly selective — returning a small number of results from a large dataset\n", + "- Combine scalar filters with a single vector column in the same index definition\n", + "- Designed for searches that apply one vector value alongside scalar attributes that remove large portions of the dataset before similarity scoring\n", + "- Consume a moderate amount of memory and can index Tens of million to billion of documents\n", + "- Excel when your queries must return a small, highly targeted result set\n", "\n", - "Use Composite Vector indexes when you want to perform searches of documents using both scalars and a vector where the scalar values filter out large portions of the dataset.\n", + "Use Composite Vector Indexes when you want to perform searches that blend scalar predicates and vector similarity so that the scalar filters tighten the candidate set.\n", "\n", - "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/server/current/vector-index/use-vector-indexes.html).\n", + "For an in-depth comparison and tuning guidance, review the [Couchbase vector index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html) and the [overview of Capella vector indexes](https://docs.couchbase.com/cloud/vector-index/vectors-and-indexes-overview.html).\n", "\n", "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", "\n", - "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", + "The `index_description` parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", "\n", "Format: `'IVF[],{PQ|SQ}'`\n", "\n", @@ -721,21 +714,21 @@ "- Controls how the dataset is subdivided for faster searches\n", "- More centroids = faster search, slower training \n", "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size\n", "\n", "**Quantization Options:**\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- SQ (Scalar Quantization): `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): `PQx` (e.g., `PQ32x8`)\n", "- Higher values = better accuracy, larger index size\n", "\n", "**Common Examples:**\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "- `IVF,SQ8` – Auto centroids, 8-bit scalar quantization (good default)\n", + "- `IVF1000,SQ6` – 1000 centroids, 6-bit scalar quantization \n", + "- `IVF,PQ32x8` – Auto centroids, 32 subquantizers with 8 bits\n", "\n", "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/server/current/vector-index/hyperscale-vector-index.html#algo_settings).\n", "\n", - "In the code below, we demonstrate creating a BHIVE index for optimal performance. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI. " + "In the code below, we demonstrate creating a Hyperscale index for optimal performance. You can adapt the same flow to create a COMPOSITE index by replacing the index type and options." ] }, { @@ -744,40 +737,39 @@ "metadata": {}, "outputs": [], "source": [ - "# Create a BHIVE (Hyperscale Vector Index) for optimized vector search\n", + "# Create a Hyperscale Vector Index for optimized vector search\n", "try:\n", - " bhive_index_name = f\"{INDEX_NAME}_bhive\"\n", + " hyperscale_index_name = f\"{INDEX_NAME}_hyperscale\"\n", "\n", - " # Use the cluster connection to create the BHIVE index\n", + " # Use the cluster connection to create the Hyperscale index\n", " scope = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME)\n", " \n", " options = {\n", " \"dimension\": 3072, # text-embedding-3-large dimension\n", - " \"description\": \"IVF1024,PQ32x8\",\n", " \"similarity\": \"L2\",\n", " }\n", " \n", " scope.query(\n", " f\"\"\"\n", - " CREATE INDEX {bhive_index_name}\n", + " CREATE VECTOR INDEX {hyperscale_index_name}\n", " ON {COLLECTION_NAME} (embedding VECTOR)\n", - " USING GSI WITH {json.dumps(options)}\n", + " WITH {json.dumps(options)}\n", " \"\"\",\n", " QueryOptions(\n", " timeout=timedelta(seconds=300)\n", " )).execute()\n", - " print(f\"Successfully created BHIVE index: {bhive_index_name}\")\n", + " print(f\"Successfully created Hyperscale index: {hyperscale_index_name}\")\n", "except Exception as e:\n", - " print(f\"BHIVE index may already exist or error occurred: {str(e)}\")\n" + " print(f\"Hyperscale index may already exist or error occurred: {str(e)}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Testing Optimized GSI Vector Search\n", + "# Testing Optimized Hyperscale Vector Search\n", "\n", - "The example below shows running the same RAG query, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data." + "The example below runs the same RAG query, but now uses the Hyperscale index created above. You'll notice improved performance as the index efficiently retrieves data. If you create a Composite index, the workflow is identical — Haystack automatically routes queries through the scalar filters before performing the vector similarity search." ] }, { @@ -786,12 +778,12 @@ "metadata": {}, "outputs": [], "source": [ - "# Test the optimized GSI vector search with BHIVE index\n", - "query = \"Who will Daniel Dubois fight in Saudi Arabia on 22 February?\"\n", + "# Test the optimized Hyperscale vector search\n", + "query = \"What is latest news on the death of Charles Breslin?\"\n", "\n", "try:\n", - " # The RAG pipeline will automatically use the optimized GSI index\n", - " # Perform the semantic search with GSI optimization\n", + " # The RAG pipeline will automatically use the optimized Hyperscale index\n", + " # Perform the semantic search with Hyperscale optimization\n", " start_time = time.time()\n", " result = rag_pipeline.run({\n", " \"query_embedder\": {\"text\": query},\n", @@ -819,7 +811,7 @@ " for doc in answer.documents:\n", " print(f\"-> {doc.meta['title']}\")\n", " # Display search results\n", - " print(f\"\\nOptimized GSI Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(f\"\\nOptimized Hyperscale Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", " #print(result[\"generator\"][\"replies\"][0])\n", "\n", "except Exception as e:\n", @@ -831,16 +823,15 @@ "metadata": {}, "source": [ "# Conclusion\n", - "In this tutorial, we've built a Retrieval Augmented Generation (RAG) system using Couchbase Capella's GSI vector search, OpenAI, and Haystack. We used the BBC News dataset, which contains real-time news articles, to demonstrate how RAG can be used to answer questions about current events and provide up-to-date information that extends beyond the LLM's training data.\n", + "In this tutorial, we've built a Retrieval Augmented Generation (RAG) system using Haystack with OpenAI models and Couchbase Capella's Hyperscale and Composite Vector Indexes. Using the BBC News dataset, we demonstrated how modern vector indexes make it possible to answer up-to-date questions that extend beyond an LLM's original training data.\n", "\n", "The key components of our RAG system include:\n", "\n", - "1. **Couchbase Capella GSI Vector Search** as the high-performance vector database for storing and retrieving document embeddings\n", + "1. **Couchbase Capella Hyperscale & Composite Vector Indexes** for high-performance storage and retrieval of document embeddings\n", "2. **Haystack** as the framework for building modular RAG pipelines with flexible component connections\n", "3. **OpenAI Services** for generating embeddings (`text-embedding-3-large`) and LLM responses (`gpt-4o`)\n", - "4. **GSI Vector Indexes** (BHIVE/Composite) for optimized vector search performance\n", "\n", - "This approach allows us to enhance the capabilities of large language models by grounding their responses in specific, up-to-date information from our knowledge base, while leveraging Couchbase's advanced GSI vector search for optimal performance and scalability. Haystack's modular pipeline approach provides flexibility and extensibility for building complex RAG applications.\n" + "This approach grounds LLM responses in specific, current information from our knowledge base while taking advantage of Couchbase's advanced vector index options for performance and scale. Haystack's modular pipeline model keeps the solution extensible as you layer in additional data sources or services.\n" ] } ], @@ -860,7 +851,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.2" + "version": "3.13.9" } }, "nbformat": 4, diff --git a/haystack/query_based/frontmatter.md b/haystack/query_based/frontmatter.md new file mode 100644 index 0000000..f2c67f2 --- /dev/null +++ b/haystack/query_based/frontmatter.md @@ -0,0 +1,26 @@ +--- +# frontmatter +path: "/tutorial-openai-haystack-rag-with-hyperscale-or-composite-vector-index" +alt_paths: + - "/tutorial-openai-haystack-rag-with-hyperscale-vector-index" + - "/tutorial-openai-haystack-rag-with-composite-vector-index" +title: "RAG with OpenAI, Haystack, and Couchbase Hyperscale & Composite Vector Indexes" +short_title: "RAG with OpenAI, Haystack, and Hyperscale & Composite Indexes" +description: + - Learn how to build a semantic search engine using Couchbase Hyperscale and Composite Vector Indexes. + - This tutorial demonstrates how Haystack integrates Couchbase Hyperscale and Composite Vector Indexes with embeddings generated by OpenAI services. + - Perform Retrieval-Augmented Generation (RAG) using Haystack with Couchbase and OpenAI services while comparing the two index types. +content_type: tutorial +filter: sdk +technology: + - vector search +tags: + - OpenAI + - Artificial Intelligence + - Haystack + - Hyperscale Vector Index + - Composite Vector Index +sdk_language: + - python +length: 60 Mins +--- diff --git a/haystack/fts/requirements.txt b/haystack/query_based/requirements.txt similarity index 100% rename from haystack/fts/requirements.txt rename to haystack/query_based/requirements.txt diff --git a/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb new file mode 100644 index 0000000..afb2063 --- /dev/null +++ b/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -0,0 +1,928 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# BBC News Dataset RAG Pipeline with Haystack, Couchbase Search Vector Index, and OpenAI\n", + "\n", + "This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using:\n", + "- The BBC News dataset containing real-time news articles\n", + "- Couchbase Capella Search Vector Index for low-latency vector retrieval\n", + "- Haystack framework for the RAG pipeline\n", + "- OpenAI for embeddings and text generation\n", + "\n", + "The system allows users to ask questions about current events and get AI-generated answers based on the latest news articles." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Installing Necessary Libraries\n", + "\n", + "To build our RAG system, we need a set of libraries. The libraries we install handle everything from connecting to databases to performing AI tasks. Each library has a specific role: Couchbase libraries manage database operations, Haystack handles AI model integrations and pipeline management, and we will use the OpenAI SDK for generating embeddings and calling OpenAI's language models." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting datasets\n", + " Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)\n", + "Collecting haystack-ai\n", + " Downloading haystack_ai-2.20.0-py3-none-any.whl.metadata (15 kB)\n", + "Collecting couchbase-haystack\n", + " Using cached couchbase_haystack-2.1.0-py3-none-any.whl.metadata (31 kB)\n", + "Collecting openai\n", + " Downloading openai-2.8.0-py3-none-any.whl.metadata (29 kB)\n", + "Collecting pandas\n", + " Using cached pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (91 kB)\n", + "Collecting filelock (from datasets)\n", + " Using cached filelock-3.20.0-py3-none-any.whl.metadata (2.1 kB)\n", + "Collecting numpy>=1.17 (from datasets)\n", + " Downloading numpy-2.3.5-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)\n", + "Collecting pyarrow>=21.0.0 (from datasets)\n", + " Using cached pyarrow-22.0.0-cp313-cp313-macosx_12_0_arm64.whl.metadata (3.2 kB)\n", + "Collecting dill<0.4.1,>=0.3.0 (from datasets)\n", + " Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB)\n", + "Collecting requests>=2.32.2 (from datasets)\n", + " Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)\n", + "Collecting httpx<1.0.0 (from datasets)\n", + " Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)\n", + "Collecting tqdm>=4.66.3 (from datasets)\n", + " Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)\n", + "Collecting xxhash (from datasets)\n", + " Using cached xxhash-3.6.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (13 kB)\n", + "Collecting multiprocess<0.70.19 (from datasets)\n", + " Downloading multiprocess-0.70.18-py313-none-any.whl.metadata (7.2 kB)\n", + "Collecting fsspec<=2025.10.0,>=2023.1.0 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Downloading fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)\n", + "Collecting huggingface-hub<2.0,>=0.25.0 (from datasets)\n", + " Downloading huggingface_hub-1.1.4-py3-none-any.whl.metadata (13 kB)\n", + "Requirement already satisfied: packaging in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets) (25.0)\n", + "Collecting pyyaml>=5.1 (from datasets)\n", + " Using cached pyyaml-6.0.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.4 kB)\n", + "Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Downloading aiohttp-3.13.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.1 kB)\n", + "Collecting anyio (from httpx<1.0.0->datasets)\n", + " Using cached anyio-4.11.0-py3-none-any.whl.metadata (4.1 kB)\n", + "Collecting certifi (from httpx<1.0.0->datasets)\n", + " Downloading certifi-2025.11.12-py3-none-any.whl.metadata (2.5 kB)\n", + "Collecting httpcore==1.* (from httpx<1.0.0->datasets)\n", + " Using cached httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)\n", + "Collecting idna (from httpx<1.0.0->datasets)\n", + " Using cached idna-3.11-py3-none-any.whl.metadata (8.4 kB)\n", + "Collecting h11>=0.16 (from httpcore==1.*->httpx<1.0.0->datasets)\n", + " Using cached h11-0.16.0-py3-none-any.whl.metadata (8.3 kB)\n", + "Collecting hf-xet<2.0.0,>=1.2.0 (from huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Downloading hf_xet-1.2.0-cp37-abi3-macosx_11_0_arm64.whl.metadata (4.9 kB)\n", + "Collecting shellingham (from huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Using cached shellingham-1.5.4-py2.py3-none-any.whl.metadata (3.5 kB)\n", + "Collecting typer-slim (from huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Downloading typer_slim-0.20.0-py3-none-any.whl.metadata (16 kB)\n", + "Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)\n", + "Collecting docstring-parser (from haystack-ai)\n", + " Using cached docstring_parser-0.17.0-py3-none-any.whl.metadata (3.5 kB)\n", + "Collecting filetype (from haystack-ai)\n", + " Using cached filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)\n", + "Collecting haystack-experimental (from haystack-ai)\n", + " Using cached haystack_experimental-0.14.2-py3-none-any.whl.metadata (18 kB)\n", + "Collecting jinja2 (from haystack-ai)\n", + " Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)\n", + "Collecting jsonschema (from haystack-ai)\n", + " Using cached jsonschema-4.25.1-py3-none-any.whl.metadata (7.6 kB)\n", + "Collecting lazy-imports (from haystack-ai)\n", + " Using cached lazy_imports-1.1.0-py3-none-any.whl.metadata (11 kB)\n", + "Collecting more-itertools (from haystack-ai)\n", + " Using cached more_itertools-10.8.0-py3-none-any.whl.metadata (39 kB)\n", + "Collecting networkx (from haystack-ai)\n", + " Using cached networkx-3.5-py3-none-any.whl.metadata (6.3 kB)\n", + "Collecting posthog!=3.12.0 (from haystack-ai)\n", + " Downloading posthog-7.0.1-py3-none-any.whl.metadata (6.0 kB)\n", + "Collecting pydantic (from haystack-ai)\n", + " Using cached pydantic-2.12.4-py3-none-any.whl.metadata (89 kB)\n", + "Requirement already satisfied: python-dateutil in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai) (2.9.0.post0)\n", + "Collecting tenacity!=8.4.0 (from haystack-ai)\n", + " Using cached tenacity-9.1.2-py3-none-any.whl.metadata (1.2 kB)\n", + "Collecting backports-datetime-fromisoformat (from couchbase-haystack)\n", + " Using cached backports_datetime_fromisoformat-2.0.3-cp313-cp313-macosx_10_13_universal2.whl\n", + "Collecting couchbase==4.* (from couchbase-haystack)\n", + " Using cached couchbase-4.5.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (23 kB)\n", + "Collecting distro<2,>=1.7.0 (from openai)\n", + " Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)\n", + "Collecting jiter<1,>=0.10.0 (from openai)\n", + " Downloading jiter-0.12.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (5.2 kB)\n", + "Collecting sniffio (from openai)\n", + " Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)\n", + "Collecting annotated-types>=0.6.0 (from pydantic->haystack-ai)\n", + " Using cached annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)\n", + "Collecting pydantic-core==2.41.5 (from pydantic->haystack-ai)\n", + " Using cached pydantic_core-2.41.5-cp313-cp313-macosx_11_0_arm64.whl.metadata (7.3 kB)\n", + "Collecting typing-inspection>=0.4.2 (from pydantic->haystack-ai)\n", + " Using cached typing_inspection-0.4.2-py3-none-any.whl.metadata (2.6 kB)\n", + "Collecting pytz>=2020.1 (from pandas)\n", + " Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)\n", + "Collecting tzdata>=2022.7 (from pandas)\n", + " Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)\n", + "Collecting aiohappyeyeballs>=2.5.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)\n", + "Collecting aiosignal>=1.4.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached aiosignal-1.4.0-py3-none-any.whl.metadata (3.7 kB)\n", + "Collecting attrs>=17.3.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached attrs-25.4.0-py3-none-any.whl.metadata (10 kB)\n", + "Collecting frozenlist>=1.1.1 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached frozenlist-1.8.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (20 kB)\n", + "Collecting multidict<7.0,>=4.5 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached multidict-6.7.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (5.3 kB)\n", + "Collecting propcache>=0.2.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached propcache-0.4.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (13 kB)\n", + "Collecting yarl<2.0,>=1.17.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached yarl-1.22.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (75 kB)\n", + "Requirement already satisfied: six>=1.5 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from posthog!=3.12.0->haystack-ai) (1.17.0)\n", + "Collecting backoff>=1.10.0 (from posthog!=3.12.0->haystack-ai)\n", + " Using cached backoff-2.2.1-py3-none-any.whl.metadata (14 kB)\n", + "Collecting charset_normalizer<4,>=2 (from requests>=2.32.2->datasets)\n", + " Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl.metadata (37 kB)\n", + "Collecting urllib3<3,>=1.21.1 (from requests>=2.32.2->datasets)\n", + " Using cached urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)\n", + "Collecting rich (from haystack-experimental->haystack-ai)\n", + " Downloading rich-14.2.0-py3-none-any.whl.metadata (18 kB)\n", + "Collecting MarkupSafe>=2.0 (from jinja2->haystack-ai)\n", + " Using cached markupsafe-3.0.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.7 kB)\n", + "Collecting jsonschema-specifications>=2023.03.6 (from jsonschema->haystack-ai)\n", + " Using cached jsonschema_specifications-2025.9.1-py3-none-any.whl.metadata (2.9 kB)\n", + "Collecting referencing>=0.28.4 (from jsonschema->haystack-ai)\n", + " Using cached referencing-0.37.0-py3-none-any.whl.metadata (2.8 kB)\n", + "Collecting rpds-py>=0.7.1 (from jsonschema->haystack-ai)\n", + " Downloading rpds_py-0.29.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (4.1 kB)\n", + "Collecting markdown-it-py>=2.2.0 (from rich->haystack-experimental->haystack-ai)\n", + " Using cached markdown_it_py-4.0.0-py3-none-any.whl.metadata (7.3 kB)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from rich->haystack-experimental->haystack-ai) (2.19.2)\n", + "Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich->haystack-experimental->haystack-ai)\n", + " Using cached mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)\n", + "Collecting click>=8.0.0 (from typer-slim->huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)\n", + "Downloading datasets-4.4.1-py3-none-any.whl (511 kB)\n", + "Using cached dill-0.4.0-py3-none-any.whl (119 kB)\n", + "Downloading fsspec-2025.10.0-py3-none-any.whl (200 kB)\n", + "Using cached httpx-0.28.1-py3-none-any.whl (73 kB)\n", + "Using cached httpcore-1.0.9-py3-none-any.whl (78 kB)\n", + "Downloading huggingface_hub-1.1.4-py3-none-any.whl (515 kB)\n", + "Downloading hf_xet-1.2.0-cp37-abi3-macosx_11_0_arm64.whl (2.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.7/2.7 MB\u001b[0m \u001b[31m20.5 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading multiprocess-0.70.18-py313-none-any.whl (151 kB)\n", + "Downloading haystack_ai-2.20.0-py3-none-any.whl (624 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m624.7/624.7 kB\u001b[0m \u001b[31m20.4 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hUsing cached couchbase_haystack-2.1.0-py3-none-any.whl (33 kB)\n", + "Using cached couchbase-4.5.0-cp313-cp313-macosx_11_0_arm64.whl (4.3 MB)\n", + "Downloading openai-2.8.0-py3-none-any.whl (1.0 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m14.5 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hUsing cached anyio-4.11.0-py3-none-any.whl (109 kB)\n", + "Using cached distro-1.9.0-py3-none-any.whl (20 kB)\n", + "Downloading jiter-0.12.0-cp313-cp313-macosx_11_0_arm64.whl (318 kB)\n", + "Using cached pydantic-2.12.4-py3-none-any.whl (463 kB)\n", + "Using cached pydantic_core-2.41.5-cp313-cp313-macosx_11_0_arm64.whl (1.9 MB)\n", + "Using cached typing_extensions-4.15.0-py3-none-any.whl (44 kB)\n", + "Using cached pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl (10.7 MB)\n", + "Downloading aiohttp-3.13.2-cp313-cp313-macosx_11_0_arm64.whl (489 kB)\n", + "Using cached multidict-6.7.0-cp313-cp313-macosx_11_0_arm64.whl (43 kB)\n", + "Using cached yarl-1.22.0-cp313-cp313-macosx_11_0_arm64.whl (93 kB)\n", + "Using cached aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB)\n", + "Using cached aiosignal-1.4.0-py3-none-any.whl (7.5 kB)\n", + "Using cached annotated_types-0.7.0-py3-none-any.whl (13 kB)\n", + "Using cached attrs-25.4.0-py3-none-any.whl (67 kB)\n", + "Using cached frozenlist-1.8.0-cp313-cp313-macosx_11_0_arm64.whl (49 kB)\n", + "Using cached h11-0.16.0-py3-none-any.whl (37 kB)\n", + "Using cached idna-3.11-py3-none-any.whl (71 kB)\n", + "Downloading numpy-2.3.5-cp313-cp313-macosx_14_0_arm64.whl (5.1 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.1/5.1 MB\u001b[0m \u001b[31m18.5 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hDownloading posthog-7.0.1-py3-none-any.whl (145 kB)\n", + "Using cached requests-2.32.5-py3-none-any.whl (64 kB)\n", + "Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl (208 kB)\n", + "Using cached urllib3-2.5.0-py3-none-any.whl (129 kB)\n", + "Using cached backoff-2.2.1-py3-none-any.whl (15 kB)\n", + "Downloading certifi-2025.11.12-py3-none-any.whl (159 kB)\n", + "Using cached propcache-0.4.1-cp313-cp313-macosx_11_0_arm64.whl (46 kB)\n", + "Using cached pyarrow-22.0.0-cp313-cp313-macosx_12_0_arm64.whl (34.2 MB)\n", + "Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)\n", + "Using cached pyyaml-6.0.3-cp313-cp313-macosx_11_0_arm64.whl (173 kB)\n", + "Using cached sniffio-1.3.1-py3-none-any.whl (10 kB)\n", + "Using cached tenacity-9.1.2-py3-none-any.whl (28 kB)\n", + "Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)\n", + "Using cached typing_inspection-0.4.2-py3-none-any.whl (14 kB)\n", + "Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)\n", + "Using cached docstring_parser-0.17.0-py3-none-any.whl (36 kB)\n", + "Using cached filelock-3.20.0-py3-none-any.whl (16 kB)\n", + "Using cached filetype-1.2.0-py2.py3-none-any.whl (19 kB)\n", + "Using cached haystack_experimental-0.14.2-py3-none-any.whl (79 kB)\n", + "Using cached jinja2-3.1.6-py3-none-any.whl (134 kB)\n", + "Using cached markupsafe-3.0.3-cp313-cp313-macosx_11_0_arm64.whl (12 kB)\n", + "Using cached jsonschema-4.25.1-py3-none-any.whl (90 kB)\n", + "Using cached jsonschema_specifications-2025.9.1-py3-none-any.whl (18 kB)\n", + "Using cached referencing-0.37.0-py3-none-any.whl (26 kB)\n", + "Downloading rpds_py-0.29.0-cp313-cp313-macosx_11_0_arm64.whl (360 kB)\n", + "Using cached lazy_imports-1.1.0-py3-none-any.whl (18 kB)\n", + "Using cached more_itertools-10.8.0-py3-none-any.whl (69 kB)\n", + "Using cached networkx-3.5-py3-none-any.whl (2.0 MB)\n", + "Downloading rich-14.2.0-py3-none-any.whl (243 kB)\n", + "Using cached markdown_it_py-4.0.0-py3-none-any.whl (87 kB)\n", + "Using cached mdurl-0.1.2-py3-none-any.whl (10.0 kB)\n", + "Using cached shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)\n", + "Downloading typer_slim-0.20.0-py3-none-any.whl (47 kB)\n", + "Downloading click-8.3.1-py3-none-any.whl (108 kB)\n", + "Using cached xxhash-3.6.0-cp313-cp313-macosx_11_0_arm64.whl (30 kB)\n", + "Installing collected packages: pytz, filetype, xxhash, urllib3, tzdata, typing-extensions, tqdm, tenacity, sniffio, shellingham, rpds-py, pyyaml, pyarrow, propcache, numpy, networkx, multidict, more-itertools, mdurl, MarkupSafe, lazy-imports, jiter, idna, hf-xet, h11, fsspec, frozenlist, filelock, docstring-parser, distro, dill, couchbase, click, charset_normalizer, certifi, backports-datetime-fromisoformat, backoff, attrs, annotated-types, aiohappyeyeballs, yarl, typing-inspection, typer-slim, requests, referencing, pydantic-core, pandas, multiprocess, markdown-it-py, jinja2, httpcore, anyio, aiosignal, rich, pydantic, posthog, jsonschema-specifications, httpx, aiohttp, openai, jsonschema, huggingface-hub, datasets, haystack-experimental, haystack-ai, couchbase-haystack\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66/66\u001b[0m [couchbase-haystack]aystack-ai]hub]fications]\n", + "\u001b[1A\u001b[2KSuccessfully installed MarkupSafe-3.0.3 aiohappyeyeballs-2.6.1 aiohttp-3.13.2 aiosignal-1.4.0 annotated-types-0.7.0 anyio-4.11.0 attrs-25.4.0 backoff-2.2.1 backports-datetime-fromisoformat-2.0.3 certifi-2025.11.12 charset_normalizer-3.4.4 click-8.3.1 couchbase-4.5.0 couchbase-haystack-2.1.0 datasets-4.4.1 dill-0.4.0 distro-1.9.0 docstring-parser-0.17.0 filelock-3.20.0 filetype-1.2.0 frozenlist-1.8.0 fsspec-2025.10.0 h11-0.16.0 haystack-ai-2.20.0 haystack-experimental-0.14.2 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface-hub-1.1.4 idna-3.11 jinja2-3.1.6 jiter-0.12.0 jsonschema-4.25.1 jsonschema-specifications-2025.9.1 lazy-imports-1.1.0 markdown-it-py-4.0.0 mdurl-0.1.2 more-itertools-10.8.0 multidict-6.7.0 multiprocess-0.70.18 networkx-3.5 numpy-2.3.5 openai-2.8.0 pandas-2.3.3 posthog-7.0.1 propcache-0.4.1 pyarrow-22.0.0 pydantic-2.12.4 pydantic-core-2.41.5 pytz-2025.2 pyyaml-6.0.3 referencing-0.37.0 requests-2.32.5 rich-14.2.0 rpds-py-0.29.0 shellingham-1.5.4 sniffio-1.3.1 tenacity-9.1.2 tqdm-4.67.1 typer-slim-0.20.0 typing-extensions-4.15.0 typing-inspection-0.4.2 tzdata-2025.2 urllib3-2.5.0 xxhash-3.6.0 yarl-1.22.0\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install datasets haystack-ai couchbase-haystack openai pandas" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Importing Necessary Libraries\n", + "\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, Haystack components for RAG pipeline, embedding generation, and dataset loading." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "import getpass\n", + "import base64\n", + "import logging\n", + "import sys\n", + "import time\n", + "import pandas as pd\n", + "from datetime import timedelta\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import CouchbaseException\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from haystack import Pipeline, GeneratedAnswer\n", + "from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder\n", + "from haystack.components.preprocessors import DocumentCleaner\n", + "from haystack.components.writers import DocumentWriter\n", + "from haystack.components.builders.answer_builder import AnswerBuilder\n", + "from haystack.components.builders.prompt_builder import PromptBuilder\n", + "from haystack.components.generators import OpenAIGenerator\n", + "from haystack.utils import Secret\n", + "from haystack.dataclasses import Document\n", + "\n", + "from couchbase_haystack import (\n", + " CouchbaseSearchDocumentStore,\n", + " CouchbasePasswordAuthenticator,\n", + " CouchbaseClusterOptions,\n", + " CouchbaseSearchEmbeddingRetriever,\n", + ")\n", + "from couchbase.options import KnownConfigProfiles\n", + "\n", + "# Configure logging\n", + "logger = logging.getLogger(__name__)\n", + "logger.setLevel(logging.DEBUG)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Prerequisites\n", + "\n", + "## Create and Deploy Your Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.\n", + "\n", + "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:\n", + "\n", + "* Have a multi-node Capella cluster running the Data, Query, Index, and Search services.\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n", + "\n", + "### OpenAI Models Setup\n", + "\n", + "In order to create the RAG application, we need an embedding model to ingest the documents for Vector Search and a large language model (LLM) for generating the responses based on the context. \n", + "\n", + "For this implementation, we'll use OpenAI's models which provide state-of-the-art performance for both embeddings and text generation:\n", + "\n", + "**Embedding Model**: We'll use OpenAI's `text-embedding-3-large` model, which provides high-quality embeddings with 3,072 dimensions for semantic search capabilities.\n", + "\n", + "**Large Language Model**: We'll use OpenAI's `gpt-4o` model for generating responses based on the retrieved context. This model offers excellent reasoning capabilities and can handle complex queries effectively.\n", + "\n", + "**Prerequisites for OpenAI Integration**:\n", + "* Create an OpenAI account at [platform.openai.com](https://platform.openai.com)\n", + "* Generate an API key from your OpenAI dashboard\n", + "* Ensure you have sufficient credits or a valid payment method set up\n", + "* Set up your API key as an environment variable or input it securely in the notebook\n", + "\n", + "For more details about OpenAI's models and pricing, please refer to the [OpenAI documentation](https://platform.openai.com/docs/models)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Configure Couchbase Credentials\n", + "\n", + "Enter your Couchbase and OpenAI credentials:\n", + "\n", + "**OPENAI_API_KEY** is your OpenAI API key which can be obtained from your OpenAI dashboard at [platform.openai.com](https://platform.openai.com/api-keys).\n", + "\n", + "**INDEX_NAME** is the name of the Search Vector Index used for vector search operations." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "CB_CONNECTION_STRING = input(\"Couchbase Cluster URL (default: localhost): \") or \"localhost\"\n", + "CB_USERNAME = input(\"Couchbase Username (default: admin): \") or \"admin\"\n", + "CB_PASSWORD = input(\"Couchbase password (default: Password@12345): \") or \"Password@12345\"\n", + "CB_BUCKET_NAME = input(\"Couchbase Bucket: \")\n", + "CB_SCOPE_NAME = input(\"Couchbase Scope: \")\n", + "CB_COLLECTION_NAME = input(\"Couchbase Collection: \")\n", + "CB_INDEX_NAME = input(\"Vector Search Index: \")\n", + "OPENAI_API_KEY = input(\"OpenAI API Key: \")\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, OPENAI_API_KEY]):\n", + " raise ValueError(\"All configuration variables must be provided.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Bucket 'b' already exists.\n", + "Scope 's' already exists.\n", + "Collection 'c' already exists in scope 's'.\n", + "Search Vector Index 'vector_search' already exists at scope level.\n" + ] + } + ], + "source": [ + "from couchbase.cluster import Cluster \n", + "from couchbase.options import ClusterOptions\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.collections import CollectionSpec\n", + "from couchbase.management.search import SearchIndex\n", + "import json\n", + "\n", + "# Connect to Couchbase cluster\n", + "cluster = Cluster(CB_CONNECTION_STRING, ClusterOptions(\n", + " PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)))\n", + "\n", + "# Create bucket if it does not exist\n", + "bucket_manager = cluster.buckets()\n", + "try:\n", + " bucket_manager.get_bucket(CB_BUCKET_NAME)\n", + " print(f\"Bucket '{CB_BUCKET_NAME}' already exists.\")\n", + "except Exception as e:\n", + " print(f\"Bucket '{CB_BUCKET_NAME}' does not exist. Creating bucket...\")\n", + " bucket_settings = CreateBucketSettings(name=CB_BUCKET_NAME, ram_quota_mb=500)\n", + " bucket_manager.create_bucket(bucket_settings)\n", + " print(f\"Bucket '{CB_BUCKET_NAME}' created successfully.\")\n", + "\n", + "# Create scope and collection if they do not exist\n", + "collection_manager = cluster.bucket(CB_BUCKET_NAME).collections()\n", + "scopes = collection_manager.get_all_scopes()\n", + "scope_exists = any(scope.name == CB_SCOPE_NAME for scope in scopes)\n", + "\n", + "if scope_exists:\n", + " print(f\"Scope '{CB_SCOPE_NAME}' already exists.\")\n", + "else:\n", + " print(f\"Scope '{CB_SCOPE_NAME}' does not exist. Creating scope...\")\n", + " collection_manager.create_scope(CB_SCOPE_NAME)\n", + " print(f\"Scope '{CB_SCOPE_NAME}' created successfully.\")\n", + "\n", + "collections = [collection.name for scope in scopes if scope.name == CB_SCOPE_NAME for collection in scope.collections]\n", + "collection_exists = CB_COLLECTION_NAME in collections\n", + "\n", + "if collection_exists:\n", + " print(f\"Collection '{CB_COLLECTION_NAME}' already exists in scope '{CB_SCOPE_NAME}'.\")\n", + "else:\n", + " print(f\"Collection '{CB_COLLECTION_NAME}' does not exist in scope '{CB_SCOPE_NAME}'. Creating collection...\")\n", + " collection_manager.create_collection(collection_name=CB_COLLECTION_NAME, scope_name=CB_SCOPE_NAME)\n", + " print(f\"Collection '{CB_COLLECTION_NAME}' created successfully.\")\n", + "\n", + "# Create Search Vector Index from search_vector_index.json file at scope level\n", + "with open('search_vector_index.json', 'r') as search_file:\n", + " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", + " \n", + " # Update search index definition with user inputs\n", + " search_index_definition.name = CB_INDEX_NAME\n", + " search_index_definition.source_name = CB_BUCKET_NAME\n", + " \n", + " # Update types mapping\n", + " old_type_key = next(iter(search_index_definition.params['mapping']['types'].keys()))\n", + " type_obj = search_index_definition.params['mapping']['types'].pop(old_type_key)\n", + " search_index_definition.params['mapping']['types'][f\"{CB_SCOPE_NAME}.{CB_COLLECTION_NAME}\"] = type_obj\n", + " \n", + " search_index_name = search_index_definition.name\n", + " \n", + " # Get scope-level search manager\n", + " scope_search_manager = cluster.bucket(CB_BUCKET_NAME).scope(CB_SCOPE_NAME).search_indexes()\n", + " \n", + " try:\n", + " # Check if index exists at scope level\n", + " existing_index = scope_search_manager.get_index(search_index_name)\n", + " print(f\"Search Vector Index '{search_index_name}' already exists at scope level.\")\n", + " except Exception as e:\n", + " print(f\"Search Vector Index '{search_index_name}' does not exist at scope level. Creating index from search_vector_index.json...\")\n", + " with open('search_vector_index.json', 'r') as search_file:\n", + " scope_search_manager.upsert_index(search_index_definition)\n", + " print(f\"Search Vector Index '{search_index_name}' created successfully at scope level.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load and Process Movie Dataset\n", + "\n", + "Load the TMDB movie dataset and prepare documents for indexing:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading TMDB dataset...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating train split: 100%|██████████| 4803/4803 [00:00<00:00, 123144.70 examples/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total movies found: 4803\n", + "Created 4800 documents with valid overviews\n" + ] + } + ], + "source": [ + "# Load TMDB dataset\n", + "print(\"Loading TMDB dataset...\")\n", + "dataset = load_dataset(\"AiresPucrs/tmdb-5000-movies\")\n", + "movies_df = pd.DataFrame(dataset['train'])\n", + "print(f\"Total movies found: {len(movies_df)}\")\n", + "\n", + "# Create documents from movie data\n", + "docs_data = []\n", + "for _, row in movies_df.iterrows():\n", + " if pd.isna(row['overview']):\n", + " continue\n", + " \n", + " try:\n", + " docs_data.append({\n", + " 'id': str(row[\"id\"]),\n", + " 'content': f\"Title: {row['title']}\\nGenres: {', '.join([genre['name'] for genre in eval(row['genres'])])}\\nOverview: {row['overview']}\",\n", + " 'metadata': {\n", + " 'title': row['title'],\n", + " 'genres': row['genres'],\n", + " 'original_language': row['original_language'],\n", + " 'popularity': float(row['popularity']),\n", + " 'release_date': row['release_date'],\n", + " 'vote_average': float(row['vote_average']),\n", + " 'vote_count': int(row['vote_count']),\n", + " 'budget': int(row['budget']),\n", + " 'revenue': int(row['revenue'])\n", + " }\n", + " })\n", + " except Exception as e:\n", + " logger.error(f\"Error processing movie {row['title']}: {e}\")\n", + "\n", + "print(f\"Created {len(docs_data)} documents with valid overviews\")\n", + "documents = [Document(id=doc['id'], content=doc['content'], meta=doc['metadata']) \n", + " for doc in docs_data]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Initialize Document Store\n", + "\n", + "Set up the Couchbase document store for storing movie data and embeddings:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Couchbase document store initialized successfully.\n" + ] + } + ], + "source": [ + "# Initialize document store\n", + "document_store = CouchbaseSearchDocumentStore(\n", + " cluster_connection_string=Secret.from_token(CB_CONNECTION_STRING),\n", + " authenticator=CouchbasePasswordAuthenticator(\n", + " username=Secret.from_token(CB_USERNAME),\n", + " password=Secret.from_token(CB_PASSWORD)\n", + " ),\n", + " cluster_options=CouchbaseClusterOptions(\n", + " profile=KnownConfigProfiles.WanDevelopment,\n", + " ),\n", + " bucket=CB_BUCKET_NAME,\n", + " scope=CB_SCOPE_NAME,\n", + " collection=CB_COLLECTION_NAME,\n", + " vector_search_index=CB_INDEX_NAME,\n", + ")\n", + "\n", + "print(\"Couchbase document store initialized successfully.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Initialize Embedder for Document Embedding\n", + "\n", + "Configure the document embedder using Capella AI's endpoint and the E5 Mistral model. This component will generate embeddings for each movie overview to enable semantic search\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "embedder = OpenAIDocumentEmbedder(\n", + " api_key=Secret.from_token(OPENAI_API_KEY),\n", + " model=\"text-embedding-3-large\",\n", + ")\n", + "\n", + "rag_embedder = OpenAITextEmbedder(\n", + " api_key=Secret.from_token(OPENAI_API_KEY),\n", + " model=\"text-embedding-3-large\",\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Initialize LLM Generator\n", + "Configure the LLM generator using Capella AI's endpoint and Llama 3.1 model. This component will generate natural language responses based on the retrieved documents.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "llm = OpenAIGenerator(\n", + " api_key=Secret.from_token(OPENAI_API_KEY),\n", + " model=\"gpt-4o\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create Indexing Pipeline\n", + "Build the pipeline for processing and indexing movie documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "🚅 Components\n", + " - cleaner: DocumentCleaner\n", + " - embedder: OpenAIDocumentEmbedder\n", + " - writer: DocumentWriter\n", + "🛤️ Connections\n", + " - cleaner.documents -> embedder.documents (list[Document])\n", + " - embedder.documents -> writer.documents (list[Document])" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create indexing pipeline\n", + "index_pipeline = Pipeline()\n", + "index_pipeline.add_component(\"cleaner\", DocumentCleaner())\n", + "index_pipeline.add_component(\"embedder\", embedder)\n", + "index_pipeline.add_component(\"writer\", DocumentWriter(document_store=document_store))\n", + "\n", + "# Connect indexing components\n", + "index_pipeline.connect(\"cleaner.documents\", \"embedder.documents\")\n", + "index_pipeline.connect(\"embedder.documents\", \"writer.documents\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Run Indexing Pipeline\n", + "\n", + "Execute the pipeline for processing and indexing movie documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 4it [00:06, 1.73s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Processed batch 1: 100 documents\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 4it [00:06, 1.66s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Processed batch 2: 100 documents\n", + "\n", + "Successfully processed 200 documents\n", + "Sample document metadata: {'title': 'Four Rooms', 'genres': '[{\"id\": 80, \"name\": \"Crime\"}, {\"id\": 35, \"name\": \"Comedy\"}]', 'original_language': 'en', 'popularity': 22.87623, 'release_date': '1995-12-09', 'vote_average': 6.5, 'vote_count': 530, 'budget': 4000000, 'revenue': 4300000}\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "# Run indexing pipeline\n", + "\n", + "if documents:\n", + " # Process documents in batches for better performance\n", + " batch_size = 100\n", + " total_docs = len(documents[:200])\n", + " \n", + " for i in range(0, total_docs, batch_size):\n", + " batch = documents[i:i + batch_size]\n", + " result = index_pipeline.run({\"cleaner\": {\"documents\": batch}})\n", + " print(f\"Processed batch {i//batch_size + 1}: {len(batch)} documents\")\n", + " \n", + " print(f\"\\nSuccessfully processed {total_docs} documents\")\n", + " print(f\"Sample document metadata: {documents[0].meta}\")\n", + "else:\n", + " print(\"No documents created. Skipping indexing.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create RAG Pipeline\n", + "\n", + "Set up the Retrieval Augmented Generation pipeline for answering questions about movies:" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "PromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG pipeline created successfully.\n" + ] + } + ], + "source": [ + "# Define RAG prompt template\n", + "prompt_template = \"\"\"\n", + "Given these documents, answer the question.\\nDocuments:\n", + "{% for doc in documents %}\n", + " {{ doc.content }}\n", + "{% endfor %}\n", + "\n", + "\\nQuestion: {{question}}\n", + "\\nAnswer:\n", + "\"\"\"\n", + "\n", + "# Create RAG pipeline\n", + "rag_pipeline = Pipeline()\n", + "\n", + "# Add components\n", + "rag_pipeline.add_component(\n", + " \"query_embedder\",\n", + " rag_embedder,\n", + ")\n", + "rag_pipeline.add_component(\"retriever\", CouchbaseSearchEmbeddingRetriever(document_store=document_store))\n", + "rag_pipeline.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template))\n", + "rag_pipeline.add_component(\"llm\",llm)\n", + "rag_pipeline.add_component(\"answer_builder\", AnswerBuilder())\n", + "\n", + "# Connect RAG components\n", + "rag_pipeline.connect(\"query_embedder\", \"retriever.query_embedding\")\n", + "rag_pipeline.connect(\"retriever.documents\", \"prompt_builder.documents\")\n", + "rag_pipeline.connect(\"prompt_builder.prompt\", \"llm.prompt\")\n", + "rag_pipeline.connect(\"llm.replies\", \"answer_builder.replies\")\n", + "rag_pipeline.connect(\"llm.meta\", \"answer_builder.meta\")\n", + "rag_pipeline.connect(\"retriever\", \"answer_builder.documents\")\n", + "\n", + "print(\"RAG pipeline created successfully.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Ask Questions About Movies\n", + "\n", + "Use the RAG pipeline to ask questions about movies and get AI-generated answers:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "=== Retrieved Documents ===\n", + "Id: 006b97c08110cb1b9b58e03943c91fa9412cfe7a2a22830ba5b9e3eb0c342344 Title: Run Lola Run\n", + "Id: 33543dab4c048c9467d632f319e02bca94da6f178250c14d26eabfb30911a823 Title: Mambo Italiano\n", + "Id: 94c55246e02c290767531f6359b5f44145191e3f2d62a3a64ed4718a666be9f2 Title: Good bye, Lenin!\n", + "Id: 00b4d1f455e45fbffa39f72be6de635bdcdb6b8a04289ba4aea41061700b9096 Title: Mean Streets\n", + "Id: 9241f819303fe61a25e05469856c01a8843d53a6ce7cec340bf0def848ddb470 Title: Magnolia\n", + "\n", + "=== Final Answer ===\n", + "Question: Why did Manni call Lolla?\n", + "Answer: Manni called Lola because he lost 100,000 DM in a subway train that belongs to a very bad guy, and he needs her help to raise the money within 20 minutes to prevent him from having to rob a store to get the money.\n", + "\n", + "Sources:\n", + "-> Run Lola Run\n", + "-> Mambo Italiano\n", + "-> Good bye, Lenin!\n", + "-> Mean Streets\n", + "-> Magnolia\n" + ] + } + ], + "source": [ + "# Example question\n", + "question = \"Why did Manni call Lolla?\"\n", + "\n", + "# Run the RAG pipeline\n", + "result = rag_pipeline.run(\n", + " {\n", + " \"query_embedder\": {\"text\": question},\n", + " \"retriever\": {\"top_k\": 5},\n", + " \"prompt_builder\": {\"question\": question},\n", + " \"answer_builder\": {\"query\": question},\n", + " },\n", + " include_outputs_from={\"retriever\", \"query_embedder\"}\n", + ")\n", + "\n", + "# Get the generated answer\n", + "answer: GeneratedAnswer = result[\"answer_builder\"][\"answers\"][0]\n", + "\n", + "# Print retrieved documents\n", + "print(\"=== Retrieved Documents ===\")\n", + "retrieved_docs = result[\"retriever\"][\"documents\"]\n", + "for idx, doc in enumerate(retrieved_docs, start=1):\n", + " print(f\"Id: {doc.id} Title: {doc.meta['title']}\")\n", + "\n", + "# Print final results\n", + "print(\"\\n=== Final Answer ===\")\n", + "print(f\"Question: {answer.query}\")\n", + "print(f\"Answer: {answer.data}\")\n", + "print(\"\\nSources:\")\n", + "for doc in answer.documents:\n", + " print(f\"-> {doc.meta['title']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "In this tutorial, we built a Retrieval-Augmented Generation (RAG) system using Couchbase Capella, OpenAI, and Haystack with the BBC News dataset. This demonstrates how to combine Couchbase Search Vector Index with large language models to answer questions about current events using real-time information.\n", + "\n", + "The key components include:\n", + "- **Couchbase Capella Search Vector Index** for vector storage and retrieval\n", + "- **Haystack** for pipeline orchestration and component management \n", + "- **OpenAI** for embeddings (`text-embedding-3-large`) and text generation (`gpt-4o`)\n", + "\n", + "This approach enables AI applications to access and reason over current information that extends beyond the LLM's training data, making responses more accurate and relevant for real-world use cases." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/haystack/search_based/frontmatter.md b/haystack/search_based/frontmatter.md new file mode 100644 index 0000000..5225174 --- /dev/null +++ b/haystack/search_based/frontmatter.md @@ -0,0 +1,22 @@ +--- +# frontmatter +path: "/tutorial-openai-haystack-rag-with-search-vector-index" +title: "RAG with OpenAI, Haystack, and Couchbase Search Vector Index" +short_title: "RAG with OpenAI, Haystack, and Search Vector Index" +description: + - Learn how to build a semantic search engine using the Couchbase Search Vector Index. + - This tutorial demonstrates how Haystack integrates Couchbase Search Vector Index with embeddings generated by OpenAI services. + - Perform Retrieval-Augmented Generation (RAG) using Haystack with Couchbase and OpenAI services. +content_type: tutorial +filter: sdk +technology: + - vector search +tags: + - OpenAI + - Artificial Intelligence + - Haystack + - Search Vector Index +sdk_language: + - python +length: 60 Mins +--- diff --git a/haystack/gsi/requirements.txt b/haystack/search_based/requirements.txt similarity index 100% rename from haystack/gsi/requirements.txt rename to haystack/search_based/requirements.txt diff --git a/haystack/fts/fts_index.json b/haystack/search_based/search_vector_index.json similarity index 100% rename from haystack/fts/fts_index.json rename to haystack/search_based/search_vector_index.json