From 0e7e211e815fd9e99f5cbbc2199d84e019e5269b Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 18 Nov 2025 07:27:45 +0000 Subject: [PATCH] Update vector search tutorial terminology to Couchbase 8.0 standards Rename folder structure: - fts/ -> search_based/ - gsi/ -> query_based/ Update terminology throughout notebooks: - FTS -> Search Vector Index - GSI -> Hyperscale & Composite Vector Indexes Add documentation links to Choose the Right Vector Index guide. Update README.md with new folder structure and terminology. --- README.md | 18 +- .../gsi/RAG_with_Couchbase_and_Bedrock.ipynb | 1077 ---------- awsbedrock/{gsi => query_based}/.env.sample | 0 .../RAG_with_Couchbase_and_Bedrock.ipynb | 1077 ++++++++++ .../{gsi => query_based}/frontmatter.md | 0 awsbedrock/{fts => search_based}/.env.sample | 0 .../RAG_with_Couchbase_and_Bedrock.ipynb | 32 +- .../{fts => search_based}/frontmatter.md | 0 .../RAG_with_Couchbase_and_AzureOpenAI.ipynb | 947 --------- .../RAG_with_Couchbase_and_AzureOpenAI.ipynb | 1103 ----------- azure/{gsi => query_based}/.env.sample | 0 .../RAG_with_Couchbase_and_AzureOpenAI.ipynb | 1103 +++++++++++ azure/{gsi => query_based}/frontmatter.md | 0 .../RAG_with_Couchbase_and_AzureOpenAI.ipynb | 947 +++++++++ azure/{fts => search_based}/azure_index.json | 0 azure/{fts => search_based}/frontmatter.md | 0 ...h_Couchbase_and_Claude(by_Anthropic).ipynb | 1049 ---------- ...h_Couchbase_and_Claude(by_Anthropic).ipynb | 1089 ---------- claudeai/{gsi => query_based}/.env.sample | 0 ...h_Couchbase_and_Claude(by_Anthropic).ipynb | 1089 ++++++++++ claudeai/{gsi => query_based}/frontmatter.md | 0 claudeai/{fts => search_based}/.env.sample | 0 ...h_Couchbase_and_Claude(by_Anthropic).ipynb | 1049 ++++++++++ .../{fts => search_based}/claude_index.json | 0 claudeai/{fts => search_based}/frontmatter.md | 0 .../fts/RAG_with_Couchbase_and_Cohere.ipynb | 1019 ---------- .../gsi/RAG_with_Couchbase_and_Cohere.ipynb | 1059 ---------- cohere/{gsi => query_based}/.env.sample | 0 .../RAG_with_Couchbase_and_Cohere.ipynb | 1059 ++++++++++ cohere/{gsi => query_based}/frontmatter.md | 0 cohere/{fts => search_based}/.env.sample | 0 .../RAG_with_Couchbase_and_Cohere.ipynb | 1019 ++++++++++ .../{fts => search_based}/cohere_index.json | 0 cohere/{fts => search_based}/frontmatter.md | 0 .../fts/CouchbaseStorage_Demo.ipynb | 1212 ------------ .../gsi/CouchbaseStorage_Demo.ipynb | 1747 ----------------- .../{fts => query_based}/.env.sample | 0 .../query_based/CouchbaseStorage_Demo.ipynb | 1747 +++++++++++++++++ .../{gsi => query_based}/frontmatter.md | 0 .../{gsi => search_based}/.env.sample | 0 .../search_based/CouchbaseStorage_Demo.ipynb | 1212 ++++++++++++ .../{fts => search_based}/crew_index.json | 0 .../{fts => search_based}/frontmatter.md | 0 .../fts/RAG_with_Couchbase_and_CrewAI.ipynb | 1464 -------------- .../gsi/RAG_with_Couchbase_and_CrewAI.ipynb | 1578 --------------- crewai/{gsi => query_based}/.env.sample | 0 .../RAG_with_Couchbase_and_CrewAI.ipynb | 1578 +++++++++++++++ crewai/{gsi => query_based}/frontmatter.md | 0 crewai/{fts => search_based}/.env.sample | 0 .../RAG_with_Couchbase_and_CrewAI.ipynb | 1464 ++++++++++++++ crewai/{fts => search_based}/crew_index.json | 0 crewai/{fts => search_based}/frontmatter.md | 0 ...AG_with_Couchbase_Capella_and_OpenAI.ipynb | 566 ------ ...AG_with_Couchbase_Capella_and_OpenAI.ipynb | 10 +- haystack/{gsi => query_based}/frontmatter.md | 0 .../{fts => query_based}/requirements.txt | 0 ...AG_with_Couchbase_Capella_and_OpenAI.ipynb | 566 ++++++ haystack/{fts => search_based}/frontmatter.md | 0 .../{gsi => search_based}/requirements.txt | 0 .../search_vector_index.json} | 0 huggingface/{fts => query_based}/.env.sample | 0 .../{gsi => query_based}/frontmatter.md | 0 .../{gsi => query_based}/hugging_face.ipynb | 74 +- huggingface/{gsi => search_based}/.env.sample | 0 .../{fts => search_based}/frontmatter.md | 0 .../{fts => search_based}/hugging_face.ipynb | 10 +- .../huggingface_index.json | 0 .../fts/RAG_with_Couchbase_and_Jina_AI.ipynb | 1110 ----------- jinaai/{gsi => query_based}/.env.sample | 0 .../RAG_with_Couchbase_and_Jina_AI.ipynb | 82 +- jinaai/{gsi => query_based}/frontmatter.md | 0 jinaai/{fts => search_based}/.env.sample | 0 .../RAG_with_Couchbase_and_Jina_AI.ipynb | 1110 +++++++++++ jinaai/{fts => search_based}/frontmatter.md | 0 jinaai/{fts => search_based}/jina_index.json | 0 ...AG_with_Couchbase_Capella_and_OpenAI.ipynb | 10 +- lamaindex/{gsi => query_based}/frontmatter.md | 0 ...AG_with_Couchbase_Capella_and_OpenAI.ipynb | 4 +- .../{fts => search_based}/frontmatter.md | 0 .../search_vector_index.json} | 0 mistralai/gsi/mistralai.ipynb | 800 -------- mistralai/{fts => query_based}/.env.sample | 0 mistralai/{gsi => query_based}/frontmatter.md | 0 mistralai/query_based/mistralai.ipynb | 800 ++++++++ mistralai/{gsi => search_based}/.env.sample | 0 .../{fts => search_based}/frontmatter.md | 0 .../{fts => search_based}/mistralai.ipynb | 16 +- .../mistralai_index.json | 0 ...th_Couchbase_and_Openrouter_Deepseek.ipynb | 1089 ---------- .../{gsi => query_based}/.env.sample | 0 ...th_Couchbase_and_Openrouter_Deepseek.ipynb | 1089 ++++++++++ .../{gsi => query_based}/frontmatter.md | 0 .../{fts => search_based}/.env.sample | 0 ...th_Couchbase_and_Openrouter_Deepseek.ipynb | 52 +- .../{fts => search_based}/deepseek_index.json | 0 .../{fts => search_based}/frontmatter.md | 0 .../RAG_with_Couchbase_and_PydanticAI.ipynb | 899 --------- pydantic_ai/{gsi => query_based}/.env.sample | 0 .../RAG_with_Couchbase_and_PydanticAI.ipynb | 110 +- .../{gsi => query_based}/frontmatter.md | 0 pydantic_ai/{fts => search_based}/.env.sample | 0 .../RAG_with_Couchbase_and_PydanticAI.ipynb | 899 +++++++++ .../{fts => search_based}/frontmatter.md | 0 .../pydantic_ai_index.json | 0 .../RAG_with_Couchbase_and_SmolAgents.ipynb | 1002 ---------- .../RAG_with_Couchbase_and_SmolAgents.ipynb | 1038 ---------- smolagents/{gsi => query_based}/.env.sample | 0 .../RAG_with_Couchbase_and_SmolAgents.ipynb | 1038 ++++++++++ .../{gsi => query_based}/frontmatter.md | 0 smolagents/{fts => search_based}/.env.sample | 0 .../RAG_with_Couchbase_and_SmolAgents.ipynb | 1002 ++++++++++ .../{fts => search_based}/frontmatter.md | 0 .../smolagents_index.json | 0 113 files changed, 20058 insertions(+), 20056 deletions(-) delete mode 100644 awsbedrock/gsi/RAG_with_Couchbase_and_Bedrock.ipynb rename awsbedrock/{gsi => query_based}/.env.sample (100%) create mode 100644 awsbedrock/query_based/RAG_with_Couchbase_and_Bedrock.ipynb rename awsbedrock/{gsi => query_based}/frontmatter.md (100%) rename awsbedrock/{fts => search_based}/.env.sample (100%) rename awsbedrock/{fts => search_based}/RAG_with_Couchbase_and_Bedrock.ipynb (90%) rename awsbedrock/{fts => search_based}/frontmatter.md (100%) delete mode 100644 azure/fts/RAG_with_Couchbase_and_AzureOpenAI.ipynb delete mode 100644 azure/gsi/RAG_with_Couchbase_and_AzureOpenAI.ipynb rename azure/{gsi => query_based}/.env.sample (100%) create mode 100644 azure/query_based/RAG_with_Couchbase_and_AzureOpenAI.ipynb rename azure/{gsi => query_based}/frontmatter.md (100%) create mode 100644 azure/search_based/RAG_with_Couchbase_and_AzureOpenAI.ipynb rename azure/{fts => search_based}/azure_index.json (100%) rename azure/{fts => search_based}/frontmatter.md (100%) delete mode 100644 claudeai/fts/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb delete mode 100644 claudeai/gsi/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb rename claudeai/{gsi => query_based}/.env.sample (100%) create mode 100644 claudeai/query_based/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb rename claudeai/{gsi => query_based}/frontmatter.md (100%) rename claudeai/{fts => search_based}/.env.sample (100%) create mode 100644 claudeai/search_based/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb rename claudeai/{fts => search_based}/claude_index.json (100%) rename claudeai/{fts => search_based}/frontmatter.md (100%) delete mode 100644 cohere/fts/RAG_with_Couchbase_and_Cohere.ipynb delete mode 100644 cohere/gsi/RAG_with_Couchbase_and_Cohere.ipynb rename cohere/{gsi => query_based}/.env.sample (100%) create mode 100644 cohere/query_based/RAG_with_Couchbase_and_Cohere.ipynb rename cohere/{gsi => query_based}/frontmatter.md (100%) rename cohere/{fts => search_based}/.env.sample (100%) create mode 100644 cohere/search_based/RAG_with_Couchbase_and_Cohere.ipynb rename cohere/{fts => search_based}/cohere_index.json (100%) rename cohere/{fts => search_based}/frontmatter.md (100%) delete mode 100644 crewai-short-term-memory/fts/CouchbaseStorage_Demo.ipynb delete mode 100644 crewai-short-term-memory/gsi/CouchbaseStorage_Demo.ipynb rename crewai-short-term-memory/{fts => query_based}/.env.sample (100%) create mode 100644 crewai-short-term-memory/query_based/CouchbaseStorage_Demo.ipynb rename crewai-short-term-memory/{gsi => query_based}/frontmatter.md (100%) rename crewai-short-term-memory/{gsi => search_based}/.env.sample (100%) create mode 100644 crewai-short-term-memory/search_based/CouchbaseStorage_Demo.ipynb rename crewai-short-term-memory/{fts => search_based}/crew_index.json (100%) rename crewai-short-term-memory/{fts => search_based}/frontmatter.md (100%) delete mode 100644 crewai/fts/RAG_with_Couchbase_and_CrewAI.ipynb delete mode 100644 crewai/gsi/RAG_with_Couchbase_and_CrewAI.ipynb rename crewai/{gsi => query_based}/.env.sample (100%) create mode 100644 crewai/query_based/RAG_with_Couchbase_and_CrewAI.ipynb rename crewai/{gsi => query_based}/frontmatter.md (100%) rename crewai/{fts => search_based}/.env.sample (100%) create mode 100644 crewai/search_based/RAG_with_Couchbase_and_CrewAI.ipynb rename crewai/{fts => search_based}/crew_index.json (100%) rename crewai/{fts => search_based}/frontmatter.md (100%) delete mode 100644 haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb rename haystack/{gsi => query_based}/RAG_with_Couchbase_Capella_and_OpenAI.ipynb (98%) rename haystack/{gsi => query_based}/frontmatter.md (100%) rename haystack/{fts => query_based}/requirements.txt (100%) create mode 100644 haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb rename haystack/{fts => search_based}/frontmatter.md (100%) rename haystack/{gsi => search_based}/requirements.txt (100%) rename haystack/{fts/fts_index.json => search_based/search_vector_index.json} (100%) rename huggingface/{fts => query_based}/.env.sample (100%) rename huggingface/{gsi => query_based}/frontmatter.md (100%) rename huggingface/{gsi => query_based}/hugging_face.ipynb (90%) rename huggingface/{gsi => search_based}/.env.sample (100%) rename huggingface/{fts => search_based}/frontmatter.md (100%) rename huggingface/{fts => search_based}/hugging_face.ipynb (93%) rename huggingface/{fts => search_based}/huggingface_index.json (100%) delete mode 100644 jinaai/fts/RAG_with_Couchbase_and_Jina_AI.ipynb rename jinaai/{gsi => query_based}/.env.sample (100%) rename jinaai/{gsi => query_based}/RAG_with_Couchbase_and_Jina_AI.ipynb (93%) rename jinaai/{gsi => query_based}/frontmatter.md (100%) rename jinaai/{fts => search_based}/.env.sample (100%) create mode 100644 jinaai/search_based/RAG_with_Couchbase_and_Jina_AI.ipynb rename jinaai/{fts => search_based}/frontmatter.md (100%) rename jinaai/{fts => search_based}/jina_index.json (100%) rename lamaindex/{gsi => query_based}/RAG_with_Couchbase_Capella_and_OpenAI.ipynb (98%) rename lamaindex/{gsi => query_based}/frontmatter.md (100%) rename lamaindex/{fts => search_based}/RAG_with_Couchbase_Capella_and_OpenAI.ipynb (99%) rename lamaindex/{fts => search_based}/frontmatter.md (100%) rename lamaindex/{fts/fts_index.json => search_based/search_vector_index.json} (100%) delete mode 100644 mistralai/gsi/mistralai.ipynb rename mistralai/{fts => query_based}/.env.sample (100%) rename mistralai/{gsi => query_based}/frontmatter.md (100%) create mode 100644 mistralai/query_based/mistralai.ipynb rename mistralai/{gsi => search_based}/.env.sample (100%) rename mistralai/{fts => search_based}/frontmatter.md (100%) rename mistralai/{fts => search_based}/mistralai.ipynb (92%) rename mistralai/{fts => search_based}/mistralai_index.json (100%) delete mode 100644 openrouter-deepseek/gsi/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb rename openrouter-deepseek/{gsi => query_based}/.env.sample (100%) create mode 100644 openrouter-deepseek/query_based/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb rename openrouter-deepseek/{gsi => query_based}/frontmatter.md (100%) rename openrouter-deepseek/{fts => search_based}/.env.sample (100%) rename openrouter-deepseek/{fts => search_based}/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb (89%) rename openrouter-deepseek/{fts => search_based}/deepseek_index.json (100%) rename openrouter-deepseek/{fts => search_based}/frontmatter.md (100%) delete mode 100644 pydantic_ai/fts/RAG_with_Couchbase_and_PydanticAI.ipynb rename pydantic_ai/{gsi => query_based}/.env.sample (100%) rename pydantic_ai/{gsi => query_based}/RAG_with_Couchbase_and_PydanticAI.ipynb (94%) rename pydantic_ai/{gsi => query_based}/frontmatter.md (100%) rename pydantic_ai/{fts => search_based}/.env.sample (100%) create mode 100644 pydantic_ai/search_based/RAG_with_Couchbase_and_PydanticAI.ipynb rename pydantic_ai/{fts => search_based}/frontmatter.md (100%) rename pydantic_ai/{fts => search_based}/pydantic_ai_index.json (100%) delete mode 100644 smolagents/fts/RAG_with_Couchbase_and_SmolAgents.ipynb delete mode 100644 smolagents/gsi/RAG_with_Couchbase_and_SmolAgents.ipynb rename smolagents/{gsi => query_based}/.env.sample (100%) create mode 100644 smolagents/query_based/RAG_with_Couchbase_and_SmolAgents.ipynb rename smolagents/{gsi => query_based}/frontmatter.md (100%) rename smolagents/{fts => search_based}/.env.sample (100%) create mode 100644 smolagents/search_based/RAG_with_Couchbase_and_SmolAgents.ipynb rename smolagents/{fts => search_based}/frontmatter.md (100%) rename smolagents/{fts => search_based}/smolagents_index.json (100%) diff --git a/README.md b/README.md index f152a8de..cd108036 100644 --- a/README.md +++ b/README.md @@ -3,8 +3,10 @@ This repository demonstrates how to build a powerful semantic search engine using Couchbase as the backend database, combined with various AI-powered embedding and language model providers such as OpenAI, Azure OpenAI, Anthropic (Claude), Cohere, Hugging Face, Jina AI, Mistral AI, and Voyage AI. Each example provides two distinct approaches: -- **FTS (Full Text Search)**: Uses Couchbase's vector search capabilities with pre-created search indices -- **GSI (Global Secondary Index)**: Leverages Couchbase's native SQL++ queries with vector similarity functions +- **Search Vector Index**: Uses Couchbase's Search-based vector search capabilities with pre-created search indices (in `search_based/` directories) +- **Hyperscale & Composite Vector Indexes**: Leverages Couchbase's native SQL++ queries with vector similarity functions (in `query_based/` directories) + +For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it essential for applications that require intelligent information retrieval. @@ -33,11 +35,11 @@ Semantic search goes beyond simple keyword matching by understanding the context ### 2. Choose Your Approach: -#### For FTS (Full Text Search) Examples: -Use the provided `{model}_index.json` index definition file in each model's `fts/` directory to create a new vector search index in your Couchbase cluster. +#### For Search Vector Index Examples: +Use the provided `{model}_index.json` index definition file in each model's `search_based/` directory to create a new vector search index in your Couchbase cluster. -#### For GSI (Global Secondary Index) Examples: -No additional setup required. GSI index will be created in each model's example. +#### For Hyperscale & Composite Vector Index Examples: +No additional setup required. Vector indexes will be created in each model's example in the `query_based/` directory. ### 3. Run the notebook file @@ -75,9 +77,9 @@ Each notebook implements a semantic search function that performs similarity sea The system implements caching functionality using `CouchbaseCache` to improve performance for repeated queries. -## Couchbase Vector Search Index (FTS Approach Only) +## Couchbase Search Vector Index (Search-based Approach Only) -For FTS examples, you'll need to create a vector search index using the provided JSON configuration files. For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html). The following is an example for Azure OpenAI Model. +For Search Vector Index examples, you'll need to create a vector search index using the provided JSON configuration files. For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html). The following is an example for Azure OpenAI Model. ```json { diff --git a/awsbedrock/gsi/RAG_with_Couchbase_and_Bedrock.ipynb b/awsbedrock/gsi/RAG_with_Couchbase_and_Bedrock.ipynb deleted file mode 100644 index f67fe449..00000000 --- a/awsbedrock/gsi/RAG_with_Couchbase_and_Bedrock.ipynb +++ /dev/null @@ -1,1077 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction\n", - "\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Amazon Bedrock](https://aws.amazon.com/bedrock/) as both the embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using GSI( Global Secondary Index) from scratch. Alternatively if you want to perform semantic search using the FTS index, please take a look at [this.](https://developer.couchbase.com/tutorial-aws-bedrock-couchbase-rag-with-fts/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/awsbedrock/gsi/RAG_with_Couchbase_and_Bedrock.ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for AWS Bedrock\n", - "* Please follow the [instructions](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html) to set up AWS Bedrock and generate credentials.\n", - "* Ensure you have the necessary IAM permissions to access Bedrock services.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI search is supported only from version 8.0\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.2\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-aws boto3==1.37.35 python-dotenv==1.1.0\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Importing Necessary Libraries\n", - "\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "\n", - "import boto3\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (CouchbaseException,\n", - " InternalServerFailureException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_aws import BedrockEmbeddings, ChatBedrock\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.output_parsers import StrOutputParser\n", - "from langchain_core.prompts.chat import ChatPromptTemplate\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy\n", - "from tqdm import tqdm" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setup Logging\n", - "\n", - "Logging is configured to track the progress of the script and capture any errors or warnings." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Loading Sensitive Information\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like AWS credentials, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The project includes an `.env.sample` file that lists all the environment variables. To get started:\n", - "\n", - "1. Create a `.env` file in the same directory as this notebook\n", - "2. Copy the contents from `.env.sample` to your `.env` file\n", - "3. Fill in the required credentials\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "# Load environment variables from .env file if it exists\n", - "load_dotenv(override=True)\n", - "\n", - "# AWS Credentials\n", - "AWS_ACCESS_KEY_ID = os.getenv('AWS_ACCESS_KEY_ID') or input('Enter your AWS Access Key ID: ')\n", - "AWS_SECRET_ACCESS_KEY = os.getenv('AWS_SECRET_ACCESS_KEY') or getpass.getpass('Enter your AWS Secret Access Key: ')\n", - "AWS_REGION = os.getenv('AWS_REGION') or input('Enter your AWS region (default: us-east-1): ') or 'us-east-1'\n", - "\n", - "# Couchbase Settings\n", - "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: bedrock): ') or 'bedrock'\n", - "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", - "\n", - "# Check if required credentials are set\n", - "for cred_name, cred_value in {\n", - " 'AWS_ACCESS_KEY_ID': AWS_ACCESS_KEY_ID,\n", - " 'AWS_SECRET_ACCESS_KEY': AWS_SECRET_ACCESS_KEY, \n", - " 'CB_HOST': CB_HOST,\n", - " 'CB_USERNAME': CB_USERNAME,\n", - " 'CB_PASSWORD': CB_PASSWORD,\n", - " 'CB_BUCKET_NAME': CB_BUCKET_NAME\n", - "}.items():\n", - " if not cred_value:\n", - " raise ValueError(f\"{cred_name} is not set\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:21:07,348 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-08-29 13:03:42,591 - INFO - Bucket 'query-vector-search-testing' does not exist. Creating it...\n", - "2025-08-29 13:03:44,657 - INFO - Bucket 'query-vector-search-testing' created successfully.\n", - "2025-08-29 13:03:44,663 - INFO - Scope 'shared' does not exist. Creating it...\n", - "2025-08-29 13:03:44,704 - INFO - Scope 'shared' created successfully.\n", - "2025-08-29 13:03:44,714 - INFO - Collection 'bedrock' does not exist. Creating it...\n", - "2025-08-29 13:03:44,770 - INFO - Collection 'bedrock' created successfully.\n", - "2025-08-29 13:03:46,953 - INFO - All documents cleared from the collection.\n", - "2025-08-29 13:03:46,954 - INFO - Bucket 'query-vector-search-testing' exists.\n", - "2025-08-29 13:03:46,969 - INFO - Collection 'cache' does not exist. Creating it...\n", - "2025-08-29 13:03:47,025 - INFO - Collection 'cache' created successfully.\n", - "2025-08-29 13:03:49,183 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Creating Amazon Bedrock Client and Embeddings\n", - "\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. We'll use Amazon Bedrock's Titan embedding model for embeddings.\n", - "\n", - "## Using Amazon Bedrock's Titan Model\n", - "\n", - "Language models are AI systems that are trained to understand and generate human language. We'll be using Amazon Bedrock's Titan model to process user queries and generate meaningful responses. The Titan model family includes both embedding models for converting text into vector representations and text generation models for producing human-like responses.\n", - "\n", - "Key features of Amazon Bedrock's Titan models:\n", - "- Titan Embeddings model for embedding vector generation\n", - "- Titan Text model for natural language understanding and generation\n", - "- Seamless integration with AWS infrastructure\n", - "- Enterprise-grade security and scalability" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:21:15,663 - INFO - Successfully created Bedrock embeddings client\n" - ] - } - ], - "source": [ - "try:\n", - " bedrock_client = boto3.client(\n", - " service_name='bedrock-runtime',\n", - " region_name=AWS_REGION,\n", - " aws_access_key_id=AWS_ACCESS_KEY_ID,\n", - " aws_secret_access_key=AWS_SECRET_ACCESS_KEY\n", - " )\n", - " \n", - " embeddings = BedrockEmbeddings(\n", - " client=bedrock_client,\n", - " model_id=\"amazon.titan-embed-text-v2:0\"\n", - " )\n", - " logging.info(\"Successfully created Bedrock embeddings client\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating Bedrock embeddings client: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setting Up the Couchbase Query Vector Store\n", - "A vector store is where we'll keep our embeddings. The query vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, GSI converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables us to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used.\n", - "\n", - "The vector store requires a distance metric to determine how similarity between vectors is calculated. This is crucial for accurate semantic search results as different distance metrics can yield different similarity rankings. Some of the supported Distance strategies are dot, l2, euclidean, cosine, l2_squared, euclidean_squared. In our implementation we will use cosine which is particularly effective for text embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:22:15,979 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseQueryVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding = embeddings,\n", - " distance_metric=DistanceStrategy.COSINE\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:21:31,880 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of 50 articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Error Handling: If an error occurs, only the current batch is affected\n", - "3. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "4. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 50 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-08-20 14:05:53,302 - INFO - Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 50\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setting Up a Couchbase Cache\n", - "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", - "\n", - "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:22:20,978 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using Amazon Bedrock's Titan Text Express v1 Model\n", - "\n", - "Amazon Bedrock's Titan Text Express v1 is a state-of-the-art foundation model designed for fast and efficient text generation tasks. This model excels at:\n", - "\n", - "- Text generation and completion\n", - "- Question answering \n", - "- Summarization\n", - "- Content rewriting\n", - "- Analysis and extraction\n", - "\n", - "Key features of Titan Text Express v1:\n", - "\n", - "- Optimized for low-latency responses while maintaining high quality output\n", - "- Supports up to 8K tokens context window\n", - "- Built-in content filtering and safety controls\n", - "- Cost-effective compared to larger models\n", - "- Seamlessly integrates with AWS services\n", - "\n", - "The model uses a temperature parameter (0-1) to control randomness in responses:\n", - "- Lower values (e.g. 0) produce more focused, deterministic outputs\n", - "- Higher values introduce more creativity and variation\n", - "\n", - "We'll be using this model through Amazon Bedrock's API to process user queries and generate contextually relevant responses based on our vector database content." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:22:24,513 - INFO - Successfully created Bedrock LLM client\n" - ] - } - ], - "source": [ - "try:\n", - " llm = ChatBedrock(\n", - " client=bedrock_client,\n", - " model_id=\"amazon.titan-text-express-v1\",\n", - " model_kwargs={\"temperature\": 0}\n", - " )\n", - " logging.info(\"Successfully created Bedrock LLM client\")\n", - "except Exception as e:\n", - " logging.error(f\"Error creating Bedrock LLM client: {str(e)}. Please check your AWS credentials and Bedrock access.\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:23:51,477 - INFO - Semantic search completed in 1.29 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 1.29 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3512, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", - "\n", - "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night – five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", - "\n", - "Littler was hugged by his parents after victory over Meikle\n", - "\n", - "Littler returned to Alexandra Palace to a boisterous reception from more than 3,000 spectators and delivered an astonishing display in the fourth set. He was on for a nine-darter after his opening two throws in both of the first two legs and completed the set in 32 darts - the minimum possible is 27. The teenager will next play after Christmas against European Championship winner Ritchie Edhouse, the 29th seed, or Ian White, and is seeded to meet Humphries in the semi-finals. Having entered last year's event ranked 164th, Littler is up to fourth in the world and will go to number two if he reaches the final again this time. He has won 10 titles in his debut professional year, including the Premier League and Grand Slam of Darts. After reaching the World Championship final as a debutant aged just 16, Littler's life has been transformed and interest in darts has rocketed. Google say he was the most searched-for athlete online in the UK during 2024. This Christmas, more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies and has prompted plans to expand the World Championship. Littler was named BBC Young Sports Personality of the Year on Tuesday and was runner-up to athlete Keely Hodgkinson for the main award.\n", - "\n", - "Nick Kenny will play world champion Luke Humphries in round three after Christmas\n", - "\n", - "Barneveld was shocked 3-1 by world number 76 Kenny, who was in tears after a famous victory. Kenny, 32, will face Humphries in round three after defeating the Dutchman, who won the BDO world title four times and the PDC crown in 2007. Van Barneveld, ranked 32nd, became the sixth seed to exit in the second round. His compatriot Noppert, the 13th seed, was stunned 3-1 by Joyce, who will face Ryan Searle or Matt Campbell next, with the winner of that tie potentially meeting Littler in the last 16. Elsewhere, 15th seed Chris Dobey booked his place in the third round with a 3-1 win over Alexander Merkx. Englishman Dobey concluded an afternoon session which started with a trio of 3-0 scorelines. Northern Ireland's Brendan Dolan beat Lok Yin Lee to set up a meeting with three-time champion Michael van Gerwen after Christmas. In the final two first-round matches of the 2025 competition, Wales' Rhys Griffin beat Karel Sedlacek of the Czech Republic before Asia number one Alexis Toylo cruised past Richard Veenstra.\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.4124, Text: The Littler effect - how darts hit the bullseye\n", - "\n", - "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", - "\n", - "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", - "\n", - "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned £200,000 for finishing second - part of £1m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", - "\n", - "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", - "\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80)\n", - "\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80)\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Optimizing Vector Search with Global Secondary Index (GSI)\n", - "\n", - "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Global Secondary Index (GSI) in Couchbase.\n", - "\n", - "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", - "\n", - "Hyperscale Vector Indexes (BHIVE)\n", - "- Best for pure vector searches - content discovery, recommendations, semantic search\n", - "- High performance with low memory footprint - designed to scale to billions of vectors\n", - "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", - "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", - "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", - "\n", - "Composite Vector Indexes \n", - "- Best for filtered vector searches - combines vector search with scalar value filtering\n", - "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", - "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", - "\n", - "Choosing the Right Index Type\n", - "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", - "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", - "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", - "\n", - "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", - "\n", - "\n", - "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", - "\n", - "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", - "\n", - "Format: `'IVF[],{PQ|SQ}'`\n", - "\n", - "Centroids (IVF - Inverted File):\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- More centroids = faster search, slower training \n", - "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", - "\n", - "Quantization Options:\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", - "- Higher values = better accuracy, larger index size\n", - "\n", - "Common Examples:\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [], - "source": [ - "from langchain_couchbase.vectorstores import IndexType\n", - "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"bedrock_bhive_index\",index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The example below shows running the same similarity search, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data.\n", - "\n", - "**Important**: When using Composite indexes, scalar filters take precedence over vector similarity, which can improve performance for filtered searches but may miss some semantically relevant results that don't match the scalar criteria.\n", - "\n", - "Note: In GSI vector search, the distance represents the vector distance between the query and document embeddings. Lower distance indicate higher similarity, while higher distance indicate lower similarity." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:24:54,503 - INFO - Semantic search completed in 0.36 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 0.36 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3512, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", - "\n", - "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night – five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", - "\n", - "Littler was hugged by his parents after victory over Meikle\n", - "\n", - "Littler returned to Alexandra Palace to a boisterous reception from more than 3,000 spectators and delivered an astonishing display in the fourth set. He was on for a nine-darter after his opening two throws in both of the first two legs and completed the set in 32 darts - the minimum possible is 27. The teenager will next play after Christmas against European Championship winner Ritchie Edhouse, the 29th seed, or Ian White, and is seeded to meet Humphries in the semi-finals. Having entered last year's event ranked 164th, Littler is up to fourth in the world and will go to number two if he reaches the final again this time. He has won 10 titles in his debut professional year, including the Premier League and Grand Slam of Darts. After reaching the World Championship final as a debutant aged just 16, Littler's life has been transformed and interest in darts has rocketed. Google say he was the most searched-for athlete online in the UK during 2024. This Christmas, more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies and has prompted plans to expand the World Championship. Littler was named BBC Young Sports Personality of the Year on Tuesday and was runner-up to athlete Keely Hodgkinson for the main award.\n", - "\n", - "Nick Kenny will play world champion Luke Humphries in round three after Christmas\n", - "\n", - "Barneveld was shocked 3-1 by world number 76 Kenny, who was in tears after a famous victory. Kenny, 32, will face Humphries in round three after defeating the Dutchman, who won the BDO world title four times and the PDC crown in 2007. Van Barneveld, ranked 32nd, became the sixth seed to exit in the second round. His compatriot Noppert, the 13th seed, was stunned 3-1 by Joyce, who will face Ryan Searle or Matt Campbell next, with the winner of that tie potentially meeting Littler in the last 16. Elsewhere, 15th seed Chris Dobey booked his place in the third round with a 3-1 win over Alexander Merkx. Englishman Dobey concluded an afternoon session which started with a trio of 3-0 scorelines. Northern Ireland's Brendan Dolan beat Lok Yin Lee to set up a meeting with three-time champion Michael van Gerwen after Christmas. In the final two first-round matches of the 2025 competition, Wales' Rhys Griffin beat Karel Sedlacek of the Czech Republic before Asia number one Alexis Toylo cruised past Richard Veenstra.\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.4124, Text: The Littler effect - how darts hit the bullseye\n", - "\n", - "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", - "\n", - "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", - "\n", - "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned £200,000 for finishing second - part of £1m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", - "\n", - "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", - "\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "\n", - "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80)\n", - "\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80)\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: To create a COMPOSITE index, the below code can be used.\n", - "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from langchain_couchbase.vectorstores import IndexType\n", - "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"bedrock_composite_index\", index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-02 12:25:08,521 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "# Create RAG prompt template\n", - "rag_prompt = ChatPromptTemplate.from_messages([\n", - " (\"system\", \"You are a helpful assistant that answers questions based on the provided context.\"),\n", - " (\"human\", \"Context: {context}\\n\\nQuestion: {question}\")\n", - "])\n", - "\n", - "# Create RAG chain\n", - "rag_chain = (\n", - " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", - " | rag_prompt\n", - " | llm\n", - " | StrOutputParser()\n", - ")\n", - "logging.info(\"Successfully created RAG chain\")" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: \n", - "Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end\n", - "RAG response generated in 0.41 seconds\n" - ] - } - ], - "source": [ - "start_time = time.time()\n", - "# Turn off excessive Logging \n", - "logging.basicConfig(level=logging.WARNING, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", - "\n", - "try:\n", - " rag_response = rag_chain.invoke(query)\n", - " rag_elapsed_time = time.time() - start_time\n", - " print(f\"RAG Response: {rag_response}\")\n", - " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: What happened in the match between Fullham and Liverpool?\n", - "Response: The match between Fullham and Liverpool ended in a 2-2 draw.\n", - "Time taken: 2.30 seconds\n", - "\n", - "Query 2: What were Luke Littler's key achievements and records in his recent PDC World Championship match?\n", - "Response: \n", - "Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end\n", - "Time taken: 0.40 seconds\n", - "\n", - "Query 3: What happened in the match between Fullham and Liverpool?\n", - "Response: The match between Fullham and Liverpool ended in a 2-2 draw.\n", - "Time taken: 0.36 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"What happened in the match between Fullham and Liverpool?\",\n", - " \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\",\n", - " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - "\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response}\")\n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and AWS Bedrock. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using GSI which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python (jupyter_env)", - "language": "python", - "name": "jupyter_env" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.16" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/awsbedrock/gsi/.env.sample b/awsbedrock/query_based/.env.sample similarity index 100% rename from awsbedrock/gsi/.env.sample rename to awsbedrock/query_based/.env.sample diff --git a/awsbedrock/query_based/RAG_with_Couchbase_and_Bedrock.ipynb b/awsbedrock/query_based/RAG_with_Couchbase_and_Bedrock.ipynb new file mode 100644 index 00000000..995f07af --- /dev/null +++ b/awsbedrock/query_based/RAG_with_Couchbase_and_Bedrock.ipynb @@ -0,0 +1,1077 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction\n", + "\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Amazon Bedrock](https://aws.amazon.com/bedrock/) as both the embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Hyperscale and Composite Vector Indexes from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Search Vector Index, please take a look at [this.](https://developer.couchbase.com/tutorial-aws-bedrock-couchbase-rag-with-search-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/awsbedrock/gsi/RAG_with_Couchbase_and_Bedrock.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for AWS Bedrock\n", + "* Please follow the [instructions](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html) to set up AWS Bedrock and generate credentials.\n", + "* Ensure you have the necessary IAM permissions to access Bedrock services.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI search is supported only from version 8.0\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.2\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-aws boto3==1.37.35 python-dotenv==1.1.0\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Importing Necessary Libraries\n", + "\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "\n", + "import boto3\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (CouchbaseException,\n", + " InternalServerFailureException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_aws import BedrockEmbeddings, ChatBedrock\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts.chat import ChatPromptTemplate\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy\n", + "from tqdm import tqdm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setup Logging\n", + "\n", + "Logging is configured to track the progress of the script and capture any errors or warnings." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Loading Sensitive Information\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like AWS credentials, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The project includes an `.env.sample` file that lists all the environment variables. To get started:\n", + "\n", + "1. Create a `.env` file in the same directory as this notebook\n", + "2. Copy the contents from `.env.sample` to your `.env` file\n", + "3. Fill in the required credentials\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Load environment variables from .env file if it exists\n", + "load_dotenv(override=True)\n", + "\n", + "# AWS Credentials\n", + "AWS_ACCESS_KEY_ID = os.getenv('AWS_ACCESS_KEY_ID') or input('Enter your AWS Access Key ID: ')\n", + "AWS_SECRET_ACCESS_KEY = os.getenv('AWS_SECRET_ACCESS_KEY') or getpass.getpass('Enter your AWS Secret Access Key: ')\n", + "AWS_REGION = os.getenv('AWS_REGION') or input('Enter your AWS region (default: us-east-1): ') or 'us-east-1'\n", + "\n", + "# Couchbase Settings\n", + "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: bedrock): ') or 'bedrock'\n", + "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", + "\n", + "# Check if required credentials are set\n", + "for cred_name, cred_value in {\n", + " 'AWS_ACCESS_KEY_ID': AWS_ACCESS_KEY_ID,\n", + " 'AWS_SECRET_ACCESS_KEY': AWS_SECRET_ACCESS_KEY, \n", + " 'CB_HOST': CB_HOST,\n", + " 'CB_USERNAME': CB_USERNAME,\n", + " 'CB_PASSWORD': CB_PASSWORD,\n", + " 'CB_BUCKET_NAME': CB_BUCKET_NAME\n", + "}.items():\n", + " if not cred_value:\n", + " raise ValueError(f\"{cred_name} is not set\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:21:07,348 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-08-29 13:03:42,591 - INFO - Bucket 'query-vector-search-testing' does not exist. Creating it...\n", + "2025-08-29 13:03:44,657 - INFO - Bucket 'query-vector-search-testing' created successfully.\n", + "2025-08-29 13:03:44,663 - INFO - Scope 'shared' does not exist. Creating it...\n", + "2025-08-29 13:03:44,704 - INFO - Scope 'shared' created successfully.\n", + "2025-08-29 13:03:44,714 - INFO - Collection 'bedrock' does not exist. Creating it...\n", + "2025-08-29 13:03:44,770 - INFO - Collection 'bedrock' created successfully.\n", + "2025-08-29 13:03:46,953 - INFO - All documents cleared from the collection.\n", + "2025-08-29 13:03:46,954 - INFO - Bucket 'query-vector-search-testing' exists.\n", + "2025-08-29 13:03:46,969 - INFO - Collection 'cache' does not exist. Creating it...\n", + "2025-08-29 13:03:47,025 - INFO - Collection 'cache' created successfully.\n", + "2025-08-29 13:03:49,183 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Creating Amazon Bedrock Client and Embeddings\n", + "\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. We'll use Amazon Bedrock's Titan embedding model for embeddings.\n", + "\n", + "## Using Amazon Bedrock's Titan Model\n", + "\n", + "Language models are AI systems that are trained to understand and generate human language. We'll be using Amazon Bedrock's Titan model to process user queries and generate meaningful responses. The Titan model family includes both embedding models for converting text into vector representations and text generation models for producing human-like responses.\n", + "\n", + "Key features of Amazon Bedrock's Titan models:\n", + "- Titan Embeddings model for embedding vector generation\n", + "- Titan Text model for natural language understanding and generation\n", + "- Seamless integration with AWS infrastructure\n", + "- Enterprise-grade security and scalability" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:21:15,663 - INFO - Successfully created Bedrock embeddings client\n" + ] + } + ], + "source": [ + "try:\n", + " bedrock_client = boto3.client(\n", + " service_name='bedrock-runtime',\n", + " region_name=AWS_REGION,\n", + " aws_access_key_id=AWS_ACCESS_KEY_ID,\n", + " aws_secret_access_key=AWS_SECRET_ACCESS_KEY\n", + " )\n", + " \n", + " embeddings = BedrockEmbeddings(\n", + " client=bedrock_client,\n", + " model_id=\"amazon.titan-embed-text-v2:0\"\n", + " )\n", + " logging.info(\"Successfully created Bedrock embeddings client\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating Bedrock embeddings client: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting Up the Couchbase Query Vector Store\n", + "A vector store is where we'll keep our embeddings. The query vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, GSI converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables us to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used.\n", + "\n", + "The vector store requires a distance metric to determine how similarity between vectors is calculated. This is crucial for accurate semantic search results as different distance metrics can yield different similarity rankings. Some of the supported Distance strategies are dot, l2, euclidean, cosine, l2_squared, euclidean_squared. In our implementation we will use cosine which is particularly effective for text embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:22:15,979 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding = embeddings,\n", + " distance_metric=DistanceStrategy.COSINE\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:21:31,880 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of 50 articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Error Handling: If an error occurs, only the current batch is affected\n", + "3. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "4. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 50 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-08-20 14:05:53,302 - INFO - Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 50\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting Up a Couchbase Cache\n", + "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", + "\n", + "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:22:20,978 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using Amazon Bedrock's Titan Text Express v1 Model\n", + "\n", + "Amazon Bedrock's Titan Text Express v1 is a state-of-the-art foundation model designed for fast and efficient text generation tasks. This model excels at:\n", + "\n", + "- Text generation and completion\n", + "- Question answering \n", + "- Summarization\n", + "- Content rewriting\n", + "- Analysis and extraction\n", + "\n", + "Key features of Titan Text Express v1:\n", + "\n", + "- Optimized for low-latency responses while maintaining high quality output\n", + "- Supports up to 8K tokens context window\n", + "- Built-in content filtering and safety controls\n", + "- Cost-effective compared to larger models\n", + "- Seamlessly integrates with AWS services\n", + "\n", + "The model uses a temperature parameter (0-1) to control randomness in responses:\n", + "- Lower values (e.g. 0) produce more focused, deterministic outputs\n", + "- Higher values introduce more creativity and variation\n", + "\n", + "We'll be using this model through Amazon Bedrock's API to process user queries and generate contextually relevant responses based on our vector database content." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:22:24,513 - INFO - Successfully created Bedrock LLM client\n" + ] + } + ], + "source": [ + "try:\n", + " llm = ChatBedrock(\n", + " client=bedrock_client,\n", + " model_id=\"amazon.titan-text-express-v1\",\n", + " model_kwargs={\"temperature\": 0}\n", + " )\n", + " logging.info(\"Successfully created Bedrock LLM client\")\n", + "except Exception as e:\n", + " logging.error(f\"Error creating Bedrock LLM client: {str(e)}. Please check your AWS credentials and Bedrock access.\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:23:51,477 - INFO - Semantic search completed in 1.29 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 1.29 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3512, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", + "\n", + "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night \u2013 five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", + "\n", + "Littler was hugged by his parents after victory over Meikle\n", + "\n", + "Littler returned to Alexandra Palace to a boisterous reception from more than 3,000 spectators and delivered an astonishing display in the fourth set. He was on for a nine-darter after his opening two throws in both of the first two legs and completed the set in 32 darts - the minimum possible is 27. The teenager will next play after Christmas against European Championship winner Ritchie Edhouse, the 29th seed, or Ian White, and is seeded to meet Humphries in the semi-finals. Having entered last year's event ranked 164th, Littler is up to fourth in the world and will go to number two if he reaches the final again this time. He has won 10 titles in his debut professional year, including the Premier League and Grand Slam of Darts. After reaching the World Championship final as a debutant aged just 16, Littler's life has been transformed and interest in darts has rocketed. Google say he was the most searched-for athlete online in the UK during 2024. This Christmas, more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies and has prompted plans to expand the World Championship. Littler was named BBC Young Sports Personality of the Year on Tuesday and was runner-up to athlete Keely Hodgkinson for the main award.\n", + "\n", + "Nick Kenny will play world champion Luke Humphries in round three after Christmas\n", + "\n", + "Barneveld was shocked 3-1 by world number 76 Kenny, who was in tears after a famous victory. Kenny, 32, will face Humphries in round three after defeating the Dutchman, who won the BDO world title four times and the PDC crown in 2007. Van Barneveld, ranked 32nd, became the sixth seed to exit in the second round. His compatriot Noppert, the 13th seed, was stunned 3-1 by Joyce, who will face Ryan Searle or Matt Campbell next, with the winner of that tie potentially meeting Littler in the last 16. Elsewhere, 15th seed Chris Dobey booked his place in the third round with a 3-1 win over Alexander Merkx. Englishman Dobey concluded an afternoon session which started with a trio of 3-0 scorelines. Northern Ireland's Brendan Dolan beat Lok Yin Lee to set up a meeting with three-time champion Michael van Gerwen after Christmas. In the final two first-round matches of the 2025 competition, Wales' Rhys Griffin beat Karel Sedlacek of the Czech Republic before Asia number one Alexis Toylo cruised past Richard Veenstra.\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.4124, Text: The Littler effect - how darts hit the bullseye\n", + "\n", + "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", + "\n", + "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", + "\n", + "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned \u00a3200,000 for finishing second - part of \u00a31m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", + "\n", + "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", + "\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80)\n", + "\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80)\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", + "\n", + "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Couchbase Hyperscale and Composite Vector Indexes in Couchbase.\n", + "\n", + "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", + "\n", + "Hyperscale Vector Indexes (BHIVE)\n", + "- Best for pure vector searches - content discovery, recommendations, semantic search\n", + "- High performance with low memory footprint - designed to scale to billions of vectors\n", + "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", + "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", + "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", + "\n", + "Composite Vector Indexes \n", + "- Best for filtered vector searches - combines vector search with scalar value filtering\n", + "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", + "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", + "\n", + "Choosing the Right Index Type\n", + "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", + "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", + "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", + "\n", + "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", + "\n", + "\n", + "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", + "\n", + "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", + "\n", + "Format: `'IVF[],{PQ|SQ}'`\n", + "\n", + "Centroids (IVF - Inverted File):\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- More centroids = faster search, slower training \n", + "- Fewer centroids = slower search, faster training\n", + "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "\n", + "Quantization Options:\n", + "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- Higher values = better accuracy, larger index size\n", + "\n", + "Common Examples:\n", + "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", + "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", + "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, Hyperscale and Composite Vector indexes can be created manually from the Couchbase UI." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_couchbase.vectorstores import IndexType\n", + "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"bedrock_bhive_index\",index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The example below shows running the same similarity search, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data.\n", + "\n", + "**Important**: When using Composite indexes, scalar filters take precedence over vector similarity, which can improve performance for filtered searches but may miss some semantically relevant results that don't match the scalar criteria.\n", + "\n", + "Note: In GSI vector search, the distance represents the vector distance between the query and document embeddings. Lower distance indicate higher similarity, while higher distance indicate lower similarity." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:24:54,503 - INFO - Semantic search completed in 0.36 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 0.36 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3512, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", + "\n", + "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night \u2013 five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", + "\n", + "Littler was hugged by his parents after victory over Meikle\n", + "\n", + "Littler returned to Alexandra Palace to a boisterous reception from more than 3,000 spectators and delivered an astonishing display in the fourth set. He was on for a nine-darter after his opening two throws in both of the first two legs and completed the set in 32 darts - the minimum possible is 27. The teenager will next play after Christmas against European Championship winner Ritchie Edhouse, the 29th seed, or Ian White, and is seeded to meet Humphries in the semi-finals. Having entered last year's event ranked 164th, Littler is up to fourth in the world and will go to number two if he reaches the final again this time. He has won 10 titles in his debut professional year, including the Premier League and Grand Slam of Darts. After reaching the World Championship final as a debutant aged just 16, Littler's life has been transformed and interest in darts has rocketed. Google say he was the most searched-for athlete online in the UK during 2024. This Christmas, more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies and has prompted plans to expand the World Championship. Littler was named BBC Young Sports Personality of the Year on Tuesday and was runner-up to athlete Keely Hodgkinson for the main award.\n", + "\n", + "Nick Kenny will play world champion Luke Humphries in round three after Christmas\n", + "\n", + "Barneveld was shocked 3-1 by world number 76 Kenny, who was in tears after a famous victory. Kenny, 32, will face Humphries in round three after defeating the Dutchman, who won the BDO world title four times and the PDC crown in 2007. Van Barneveld, ranked 32nd, became the sixth seed to exit in the second round. His compatriot Noppert, the 13th seed, was stunned 3-1 by Joyce, who will face Ryan Searle or Matt Campbell next, with the winner of that tie potentially meeting Littler in the last 16. Elsewhere, 15th seed Chris Dobey booked his place in the third round with a 3-1 win over Alexander Merkx. Englishman Dobey concluded an afternoon session which started with a trio of 3-0 scorelines. Northern Ireland's Brendan Dolan beat Lok Yin Lee to set up a meeting with three-time champion Michael van Gerwen after Christmas. In the final two first-round matches of the 2025 competition, Wales' Rhys Griffin beat Karel Sedlacek of the Czech Republic before Asia number one Alexis Toylo cruised past Richard Veenstra.\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.4124, Text: The Littler effect - how darts hit the bullseye\n", + "\n", + "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", + "\n", + "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", + "\n", + "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned \u00a3200,000 for finishing second - part of \u00a31m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", + "\n", + "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", + "\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "\n", + "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80)\n", + "\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80)\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: To create a COMPOSITE index, the below code can be used.\n", + "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_couchbase.vectorstores import IndexType\n", + "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"bedrock_composite_index\", index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-02 12:25:08,521 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "# Create RAG prompt template\n", + "rag_prompt = ChatPromptTemplate.from_messages([\n", + " (\"system\", \"You are a helpful assistant that answers questions based on the provided context.\"),\n", + " (\"human\", \"Context: {context}\\n\\nQuestion: {question}\")\n", + "])\n", + "\n", + "# Create RAG chain\n", + "rag_chain = (\n", + " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", + " | rag_prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")\n", + "logging.info(\"Successfully created RAG chain\")" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: \n", + "Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end\n", + "RAG response generated in 0.41 seconds\n" + ] + } + ], + "source": [ + "start_time = time.time()\n", + "# Turn off excessive Logging \n", + "logging.basicConfig(level=logging.WARNING, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", + "\n", + "try:\n", + " rag_response = rag_chain.invoke(query)\n", + " rag_elapsed_time = time.time() - start_time\n", + " print(f\"RAG Response: {rag_response}\")\n", + " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: What happened in the match between Fullham and Liverpool?\n", + "Response: The match between Fullham and Liverpool ended in a 2-2 draw.\n", + "Time taken: 2.30 seconds\n", + "\n", + "Query 2: What were Luke Littler's key achievements and records in his recent PDC World Championship match?\n", + "Response: \n", + "Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end\n", + "Time taken: 0.40 seconds\n", + "\n", + "Query 3: What happened in the match between Fullham and Liverpool?\n", + "Response: The match between Fullham and Liverpool ended in a 2-2 draw.\n", + "Time taken: 0.36 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"What happened in the match between Fullham and Liverpool?\",\n", + " \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\",\n", + " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + "\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response}\")\n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and AWS Bedrock. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using Hyperscale and Composite Vector Indexes which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python (jupyter_env)", + "language": "python", + "name": "jupyter_env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.16" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/awsbedrock/gsi/frontmatter.md b/awsbedrock/query_based/frontmatter.md similarity index 100% rename from awsbedrock/gsi/frontmatter.md rename to awsbedrock/query_based/frontmatter.md diff --git a/awsbedrock/fts/.env.sample b/awsbedrock/search_based/.env.sample similarity index 100% rename from awsbedrock/fts/.env.sample rename to awsbedrock/search_based/.env.sample diff --git a/awsbedrock/fts/RAG_with_Couchbase_and_Bedrock.ipynb b/awsbedrock/search_based/RAG_with_Couchbase_and_Bedrock.ipynb similarity index 90% rename from awsbedrock/fts/RAG_with_Couchbase_and_Bedrock.ipynb rename to awsbedrock/search_based/RAG_with_Couchbase_and_Bedrock.ipynb index c48aad5b..c93a5969 100644 --- a/awsbedrock/fts/RAG_with_Couchbase_and_Bedrock.ipynb +++ b/awsbedrock/search_based/RAG_with_Couchbase_and_Bedrock.ipynb @@ -6,7 +6,7 @@ "source": [ "# Introduction\n", "\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Amazon Bedrock](https://aws.amazon.com/bedrock/) as both the embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using the FTS service from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-aws-bedrock-couchbase-rag-with-global-secondary-index/)" + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Amazon Bedrock](https://aws.amazon.com/bedrock/) as both the embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Search Vector Index from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-aws-bedrock-couchbase-rag-with-hyperscale-or-composite-vector-index/)" ] }, { @@ -796,7 +796,7 @@ "--------------------------------------------------------------------------------\n", "Score: 0.6488, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", "\n", - "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night – five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", + "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night \u2013 five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", "\n", "Littler was hugged by his parents after victory over Meikle\n", "\n", @@ -812,16 +812,16 @@ "\n", "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", "\n", - "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned £200,000 for finishing second - part of £1m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", + "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned \u00a3200,000 for finishing second - part of \u00a31m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", "\n", "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", "\n", "Littler beat Luke Humphries to win the Premier League title in May\n", "\n", - "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like £300,000 for the year. This year it will go to £20m. I expect in five years' time, we'll be playing for £40m.\"\n", + "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like \u00a3300,000 for the year. This year it will go to \u00a320m. I expect in five years' time, we'll be playing for \u00a340m.\"\n", "\n", - "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil ‘The Power’ Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", - "• None Know a lot about Littler? Take our quiz\n", + "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil \u2018The Power\u2019 Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", + "\u2022 None Know a lot about Littler? Take our quiz\n", "--------------------------------------------------------------------------------\n", "Score: 0.5683, Text: Luke Littler is one of six contenders for the 2024 BBC Sports Personality of the Year award.\n", "\n", @@ -833,11 +833,11 @@ "\n", "Darts player Luke Littler has been named BBC Young Sports Personality of the Year 2024. The 17-year-old has enjoyed a breakthrough year after finishing runner-up at the 2024 PDC World Darts Championship in January. The Englishman, who has won 10 senior titles on the Professional Darts Corporation tour this year, is the first darts player to claim the award. \"It shows how well I have done this year, not only for myself, but I have changed the sport of darts,\" Littler told BBC One. \"I know the amount of academies that have been brought up in different locations, tickets selling out at Ally Pally in hours and the Premier League selling out - it just shows how much I have changed it.\"\n", "\n", - "He was presented with the trophy by Harry Aikines-Aryeetey - a former sprinter who won the award in 2005 - and ex-rugby union player Jodie Ounsley, both of whom are stars of the BBC television show Gladiators. Skateboarder Sky Brown, 16, and Para-swimmer William Ellard, 18, were also shortlisted for the award. Littler became a household name at the start of 2024 by reaching the World Championship final aged just 16 years and 347 days. That achievement was just the start of a trophy-laden year, with Littler winning the Premier League Darts, Grand Slam and World Series of Darts Finals among his haul of titles. Littler has gone from 164th to fourth in the world rankings and earned more than £1m in prize money in 2024. The judging panel for Young Sports Personality of the Year included Paralympic gold medallist Sammi Kinghorn, Olympic silver medal-winning BMX freestyler Keiran Reilly, television presenter Qasa Alom and Radio 1 DJ Jeremiah Asiamah, as well as representatives from the Youth Sport Trust, Blue Peter and BBC Sport.\n", + "He was presented with the trophy by Harry Aikines-Aryeetey - a former sprinter who won the award in 2005 - and ex-rugby union player Jodie Ounsley, both of whom are stars of the BBC television show Gladiators. Skateboarder Sky Brown, 16, and Para-swimmer William Ellard, 18, were also shortlisted for the award. Littler became a household name at the start of 2024 by reaching the World Championship final aged just 16 years and 347 days. That achievement was just the start of a trophy-laden year, with Littler winning the Premier League Darts, Grand Slam and World Series of Darts Finals among his haul of titles. Littler has gone from 164th to fourth in the world rankings and earned more than \u00a31m in prize money in 2024. The judging panel for Young Sports Personality of the Year included Paralympic gold medallist Sammi Kinghorn, Olympic silver medal-winning BMX freestyler Keiran Reilly, television presenter Qasa Alom and Radio 1 DJ Jeremiah Asiamah, as well as representatives from the Youth Sport Trust, Blue Peter and BBC Sport.\n", "--------------------------------------------------------------------------------\n", "Score: 0.5177, Text: Wright is the 17th seed at the World Championship\n", "\n", - "Two-time champion Peter Wright won his opening game at the PDC World Championship, while Ryan Meikle edged out Fallon Sherrock to set up a match against teenage prodigy Luke Littler. Scotland's Wright, the 2020 and 2022 winner, has been out of form this year, but overcame Wesley Plaisier 3-1 in the second round at Alexandra Palace in London. \"It was this crowd that got me through, they wanted me to win. I thank you all,\" said Wright. Meikle came from a set down to claim a 3-2 victory in his first-round match against Sherrock, who was the first woman to win matches at the tournament five years ago. The 28-year-old will now play on Saturday against Littler, who was named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson on Tuesday night. Littler, 17, will be competing on the Ally Pally stage for the first time since his rise to stardom when finishing runner-up in January's world final to Luke Humphries. Earlier on Tuesday, World Grand Prix champion Mike de Decker – the 24th seed - suffered a surprise defeat to Luke Woodhouse in the second round. He is the second seed to exit following 16th seed James Wade's defeat on Monday to Jermaine Wattimena, who meets Wright in round three. Kevin Doets recovered from a set down to win 3-1 against Noa-Lynn van Leuven, who was making history as the first transgender woman to compete in the tournament.\n", + "Two-time champion Peter Wright won his opening game at the PDC World Championship, while Ryan Meikle edged out Fallon Sherrock to set up a match against teenage prodigy Luke Littler. Scotland's Wright, the 2020 and 2022 winner, has been out of form this year, but overcame Wesley Plaisier 3-1 in the second round at Alexandra Palace in London. \"It was this crowd that got me through, they wanted me to win. I thank you all,\" said Wright. Meikle came from a set down to claim a 3-2 victory in his first-round match against Sherrock, who was the first woman to win matches at the tournament five years ago. The 28-year-old will now play on Saturday against Littler, who was named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson on Tuesday night. Littler, 17, will be competing on the Ally Pally stage for the first time since his rise to stardom when finishing runner-up in January's world final to Luke Humphries. Earlier on Tuesday, World Grand Prix champion Mike de Decker \u2013 the 24th seed - suffered a surprise defeat to Luke Woodhouse in the second round. He is the second seed to exit following 16th seed James Wade's defeat on Monday to Jermaine Wattimena, who meets Wright in round three. Kevin Doets recovered from a set down to win 3-1 against Noa-Lynn van Leuven, who was making history as the first transgender woman to compete in the tournament.\n", "\n", "Sherrock drew level at 2-2 but lost the final set to Meikle\n", "\n", @@ -849,13 +849,13 @@ "\n", "Dart sensation Luke Littler has said he \"can't quite believe\" he has trended higher than the prime minister and the King in Google's most searched for lists in 2024. The 17-year-old star was an unknown when he came to prominence as the youngest player to reach the World Darts Championship final in January. He has subsequently risen to fourth in the world rankings and his fame has led him to lie behind only Catherine, Princess of Wales, and US president elect Donald Trump as Google's most searched for person in the UK in 2024. He has also taken the top slot as the most searched for athlete on the search engine, which he said was \"a proud moment\" in what had been \"an amazing year\".\n", "\n", - "A peak TV audience of 3.7m watched the then-16-year-old's appearance in the final. He lost by seven sets to four to world number one Luke Humphries, but earned £200,000 as the runner-up. He beat Michael van Gerwen later in the same month to win the Bahrain Darts Masters and secure his first Professional Darts Corporation (PDC) senior title. The event also saw he become the youngest person to make a nine-dart finish on live television, which is considered one of the sport's highest achievements and sees a player score the required 501 in the lowest number of darts possible.\n", + "A peak TV audience of 3.7m watched the then-16-year-old's appearance in the final. He lost by seven sets to four to world number one Luke Humphries, but earned \u00a3200,000 as the runner-up. He beat Michael van Gerwen later in the same month to win the Bahrain Darts Masters and secure his first Professional Darts Corporation (PDC) senior title. The event also saw he become the youngest person to make a nine-dart finish on live television, which is considered one of the sport's highest achievements and sees a player score the required 501 in the lowest number of darts possible.\n", "\n", "Luke Littler said the award was a \"huge honour\"\n", "\n", "In May, Littler won the 2024 Premier League Darts, his first major PDC title, and in November, Littler won the Grand Slam of Darts for his first major ranking title. The corporation's statistics showed that after each win, there was increased interest in Littler online, with even his first round exit on his World Grand Prix debut in October appearing in searches.\n", "\n", - "Littler, who plays under the nickname of The Nuke, said it had been \"an amazing year for me personally, and for the sport of darts as a whole\". \"To be recognised in two Year in Search lists is a huge honour,\" he said. \"I can't quite believe I'm trending higher than both the prime minister and the King in the 'People' category—and in a year of such great sporting achievements, it's a proud moment for me to be the top trending athlete in 2024.\"\n", + "Littler, who plays under the nickname of The Nuke, said it had been \"an amazing year for me personally, and for the sport of darts as a whole\". \"To be recognised in two Year in Search lists is a huge honour,\" he said. \"I can't quite believe I'm trending higher than both the prime minister and the King in the 'People' category\u2014and in a year of such great sporting achievements, it's a proud moment for me to be the top trending athlete in 2024.\"\n", "\n", "Google's most searched people in UK in 2024 Google's most searched for athletes in UK in 2024\n", "\n", @@ -887,15 +887,15 @@ "--------------------------------------------------------------------------------\n", "Score: 0.2988, Text: Christian Kist was sealing his first televised nine-darter\n", "\n", - "Christian Kist hit a nine-darter but lost his PDC World Championship first-round match to Madars Razma. The Dutchman became the first player to seal a perfect leg in the tournament since Michael Smith did so on the way to beating Michael van Gerwen in the 2023 final. Kist, the 2012 BDO world champion at Lakeside, collects £60,000 for the feat, with the same amount being awarded by sponsors to a charity and to one spectator inside Alexandra Palace in London. The 38-year-old's brilliant finish sealed the opening set, but his Latvian opponent bounced back to win 3-1. Darts is one of the few sports that can measure perfection; snooker has the 147 maximum break, golf has the hole-in-one, darts has the nine-dart finish. Kist scored two maximum 180s to leave a 141 checkout which he completed with a double 12, to the delight of more than 3,000 spectators. The English 12th seed, who has been troubled by wrist and back injuries, could next play Andrew Gilding in the third round - which begins on 27 December - should Gilding beat the winner of Martin Lukeman's match against qualifier Nitin Kumar. Aspinall faces a tough task to reach the last four again, with 2018 champion Rob Cross and 2024 runner-up Luke Littler both in his side of the draw.\n", + "Christian Kist hit a nine-darter but lost his PDC World Championship first-round match to Madars Razma. The Dutchman became the first player to seal a perfect leg in the tournament since Michael Smith did so on the way to beating Michael van Gerwen in the 2023 final. Kist, the 2012 BDO world champion at Lakeside, collects \u00a360,000 for the feat, with the same amount being awarded by sponsors to a charity and to one spectator inside Alexandra Palace in London. The 38-year-old's brilliant finish sealed the opening set, but his Latvian opponent bounced back to win 3-1. Darts is one of the few sports that can measure perfection; snooker has the 147 maximum break, golf has the hole-in-one, darts has the nine-dart finish. Kist scored two maximum 180s to leave a 141 checkout which he completed with a double 12, to the delight of more than 3,000 spectators. The English 12th seed, who has been troubled by wrist and back injuries, could next play Andrew Gilding in the third round - which begins on 27 December - should Gilding beat the winner of Martin Lukeman's match against qualifier Nitin Kumar. Aspinall faces a tough task to reach the last four again, with 2018 champion Rob Cross and 2024 runner-up Luke Littler both in his side of the draw.\n", "\n", - "Kist - who was knocked out of last year's tournament by teenager Littler - will still earn a bigger cheque than he would have got for a routine run to the quarter-finals. His nine-darter was the 15th in the history of the championship and first since the greatest leg in darts history when Smith struck, moments after Van Gerwen just missed his attempt. Darts fan Kris, a railway worker from Sutton in south London, was the random spectator picked out to receive £60,000, with Prostate Cancer UK getting the same sum from tournament sponsors Paddy Power. \"I'm speechless to be honest. I didn't expect it to happen to me,\" Kris said. \"This was a birthday present so it makes it even better. My grandad got me tickets. It was just a normal day - I came here after work.\" Kist said: \"Hitting the double 12 felt amazing. It was a lovely moment for everyone and I hope Kris enjoys the money. Maybe I will go on vacation next month.\" Earlier, Jim Williams was favourite against Paolo Nebrida but lost 3-2 in an epic lasting more than an hour. The Filipino took a surprise 2-1 lead and Williams only went ahead for the first time in the opening leg of the deciding set. The Welshman looked on course for victory but missed five match darts. UK Open semi-finalist Ricky Evans set up a second-round match against Dave Chisnall, checking out on 109 to edge past Gordon Mathers 3-2.\n", + "Kist - who was knocked out of last year's tournament by teenager Littler - will still earn a bigger cheque than he would have got for a routine run to the quarter-finals. His nine-darter was the 15th in the history of the championship and first since the greatest leg in darts history when Smith struck, moments after Van Gerwen just missed his attempt. Darts fan Kris, a railway worker from Sutton in south London, was the random spectator picked out to receive \u00a360,000, with Prostate Cancer UK getting the same sum from tournament sponsors Paddy Power. \"I'm speechless to be honest. I didn't expect it to happen to me,\" Kris said. \"This was a birthday present so it makes it even better. My grandad got me tickets. It was just a normal day - I came here after work.\" Kist said: \"Hitting the double 12 felt amazing. It was a lovely moment for everyone and I hope Kris enjoys the money. Maybe I will go on vacation next month.\" Earlier, Jim Williams was favourite against Paolo Nebrida but lost 3-2 in an epic lasting more than an hour. The Filipino took a surprise 2-1 lead and Williams only went ahead for the first time in the opening leg of the deciding set. The Welshman looked on course for victory but missed five match darts. UK Open semi-finalist Ricky Evans set up a second-round match against Dave Chisnall, checking out on 109 to edge past Gordon Mathers 3-2.\n", "--------------------------------------------------------------------------------\n", "Score: 0.2941, Text: Gary Anderson was the fifth seed to be beaten on Sunday\n", "\n", "Two-time champion Gary Anderson has been dumped out of the PDC World Championship on his 54th birthday by Jeffrey de Graaf. The Scot, winner in 2015 and 2016, lost 3-0 to the Swede in a second-round shock at Alexandra Palace in London. \"Gary didn't really show up as he usually does. I'm very happy with the win,\" said De Graaf, 34, who had a 75% checkout success and began with an 11-dart finish. \"It's a dream come true for me. He's been my idol since I was 14 years old.\" Anderson, ranked 14th, became the 11th seed to be knocked out from the 24 who have played so far, and the fifth to fall on Sunday.\n", "\n", - "He came into the competition with the year's highest overall three-dart average of 99.66 but hit just three of his 20 checkout attempts to lose his opening match of the tournament for the first time. De Graaf will now meet Filipino qualifier Paolo Nebrida after he stunned England's Ross Smith, the 19th seed, in straight sets. Ritchie Edhouse, Dirk van Duijvenbode and Martin Schindler were the other seeds beaten on day eight. England's Callan Rydz, who hit a record first-round average of 107.06 on Thursday, followed up with a 3-0 win over 23rd seed Schindler on Sunday. The German missed double 12 for a nine-darter in the first set – the third player to do so in 24 hours after Luke Littler and Damon Heta – and ended up losing the leg. Rydz next meets Belgian Dimitri van den Bergh, who hit six 180s and averaged 96 in a 3-0 win over Irishman Dylan Slevin.\n", + "He came into the competition with the year's highest overall three-dart average of 99.66 but hit just three of his 20 checkout attempts to lose his opening match of the tournament for the first time. De Graaf will now meet Filipino qualifier Paolo Nebrida after he stunned England's Ross Smith, the 19th seed, in straight sets. Ritchie Edhouse, Dirk van Duijvenbode and Martin Schindler were the other seeds beaten on day eight. England's Callan Rydz, who hit a record first-round average of 107.06 on Thursday, followed up with a 3-0 win over 23rd seed Schindler on Sunday. The German missed double 12 for a nine-darter in the first set \u2013 the third player to do so in 24 hours after Luke Littler and Damon Heta \u2013 and ended up losing the leg. Rydz next meets Belgian Dimitri van den Bergh, who hit six 180s and averaged 96 in a 3-0 win over Irishman Dylan Slevin.\n", "\n", "England's Joe Cullen abruptly left his post-match news conference and accused the media of not showing him respect after his 3-0 win over Dutchman Wessel Nijman. Nijman, who has previously served a ban for breaching betting and anti-corruption rules, had been billed as favourite beforehand to beat 23rd seed Cullen. \"Honestly, the media attention that Wessel's got, again this is not a reflection on him,\" Cullen said. \"He seems like a fantastic kid, he's been caught up in a few things beforehand, but he's served his time and he's held his hands up, like a lot haven't. \"I think the way I've been treated probably with the media and things like that - I know you guys have no control over the bookies - I've been shown no respect, so I won't be showing any respect to any of you guys tonight. \"I'm going to go home. Cheers.\" Ian 'Diamond' White beat European champion and 29th seed Edhouse 3-1 and will face teenage star Littler in the next round. White, born in the same Cheshire town as the 17-year-old, acknowledged he would need to up his game in round three. Asked if he knew who was waiting for him, White joked: \"Yeah, Runcorn's number two. I'm from Runcorn and I'm number one.\" Ryan Searle started Sunday afternoon's action off with a 10-dart leg and went on to beat Matt Campbell 3-0, while Latvian Madars Razma defeated 25th seed Van Duijvenbode 3-1. Seventh seed Jonny Clayton and 2018 champion Rob Cross are among the players in action on Monday as the second round concludes. The third round will start on Friday after a three-day break for Christmas.\n", "--------------------------------------------------------------------------------\n" @@ -932,9 +932,9 @@ "metadata": {}, "source": [ "# Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." ] }, { @@ -1092,4 +1092,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/awsbedrock/fts/frontmatter.md b/awsbedrock/search_based/frontmatter.md similarity index 100% rename from awsbedrock/fts/frontmatter.md rename to awsbedrock/search_based/frontmatter.md diff --git a/azure/fts/RAG_with_Couchbase_and_AzureOpenAI.ipynb b/azure/fts/RAG_with_Couchbase_and_AzureOpenAI.ipynb deleted file mode 100644 index 6a9937fd..00000000 --- a/azure/fts/RAG_with_Couchbase_and_AzureOpenAI.ipynb +++ /dev/null @@ -1,947 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "kNdImxzypDlm" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [AzureOpenAI](https://azure.microsoft.com/) as the AI-powered embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using the FTS service from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-azure-openai-couchbase-rag-with-global-secondary-index/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/azure/fts/RAG_with_Couchbase_and_AzureOpenAI.ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for Azure OpenAI\n", - "\n", - "Please follow the [instructions](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) to generate the Azure OpenAI credentials.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NH2o6pqa69oG" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and AzureOpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "DYhPj0Ta8l_A" - }, - "outputs": [], - "source": [ - "!pip install datasets==3.5.0 langchain-couchbase==0.3.0 langchain-openai==0.3.13" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1pp7GtNg8mB9" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "8GzS6tfL8mFP" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/aayush.tyagi/Documents/AI/vector-search-cookbook/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - } - ], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import sys\n", - "import time\n", - "from datetime import timedelta\n", - "from uuid import uuid4\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (\n", - " CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - ")\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from langchain_core.documents import Document\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.output_parsers import StrOutputParser\n", - "from langchain_core.prompts import ChatPromptTemplate\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", - "from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings\n", - "from tqdm import tqdm" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pBnMp5vb8mIb" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "Yv8kWcuf8mLx" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K9G5a0en8mPA" - }, - "source": [ - "# Loading Sensitive Information\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "PFGyHll18mSe" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Enter your Azure OpenAI Key: ··········\n", - "Enter your Azure OpenAI Endpoint: https://first-couchbase-instance.openai.azure.com/\n", - "Enter your Azure OpenAI Embedding Deployment: text-embedding-ada-002\n", - "Enter your Azure OpenAI Chat Deployment: gpt-4o\n", - "Enter your Couchbase host (default: couchbase://localhost): couchbases://cb.hlcup4o4jmjr55yf.cloud.couchbase.com\n", - "Enter your Couchbase username (default: Administrator): vector-search-rag-demos\n", - "Enter your Couchbase password (default: password): ··········\n", - "Enter your Couchbase bucket name (default: vector-search-testing): \n", - "Enter your index name (default: vector_search_azure): \n", - "Enter your scope name (default: shared): \n", - "Enter your collection name (default: azure): \n", - "Enter your cache collection name (default: cache): \n" - ] - } - ], - "source": [ - "AZURE_OPENAI_KEY = getpass.getpass('Enter your Azure OpenAI Key: ')\n", - "AZURE_OPENAI_ENDPOINT = input('Enter your Azure OpenAI Endpoint: ')\n", - "AZURE_OPENAI_EMBEDDING_DEPLOYMENT = input('Enter your Azure OpenAI Embedding Deployment: ')\n", - "AZURE_OPENAI_CHAT_DEPLOYMENT = input('Enter your Azure OpenAI Chat Deployment: ')\n", - "\n", - "CB_HOST = input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", - "INDEX_NAME = input('Enter your index name (default: vector_search_azure): ') or 'vector_search_azure'\n", - "SCOPE_NAME = input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = input('Enter your collection name (default: azure): ') or 'azure'\n", - "CACHE_COLLECTION = input('Enter your cache collection name (default: cache): ') or 'cache'\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not all([AZURE_OPENAI_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_EMBEDDING_DEPLOYMENT, AZURE_OPENAI_CHAT_DEPLOYMENT]):\n", - " raise ValueError(\"Missing required Azure OpenAI variables\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtGrYzUY8mV3" - }, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "Zb3kK-7W8mZK" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:29:16,632 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C_Gpy32N8mcZ" - }, - "source": [ - "# Setting Up Collections in Couchbase\n", - "In Couchbase, data is organized in buckets, which can be further divided into scopes and collections. Think of a collection as a table in a traditional SQL database. Before we can store any data, we need to ensure that our collections exist. If they don't, we must create them. This step is important because it prepares the database to handle the specific types of data our application will process. By setting up collections, we define the structure of our data storage, which is essential for efficient data retrieval and management.\n", - "\n", - "Moreover, setting up collections allows us to isolate different types of data within the same bucket, providing a more organized and scalable data structure. This is particularly useful when dealing with large datasets, as it ensures that related data is stored together, making it easier to manage and query." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "ACZcwUnG8mf2" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:29:17,029 - INFO - Collection 'azure' already exists.Skipping creation.\n", - "2024-09-06 07:29:17,095 - INFO - Primary index present or created successfully.\n", - "2024-09-06 07:29:17,775 - INFO - All documents cleared from the collection.\n", - "2024-09-06 07:29:17,841 - INFO - Collection 'cache' already exists.Skipping creation.\n", - "2024-09-06 07:29:17,907 - INFO - Primary index present or created successfully.\n", - "2024-09-06 07:29:17,973 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists.Skipping creation.\")\n", - "\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Ensure primary index exists\n", - " try:\n", - " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", - " logging.info(\"Primary index present or created successfully.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error creating primary index: {str(e)}\")\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - "\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NMJ7RRYp8mjV" - }, - "source": [ - "# Loading Couchbase Vector Search Index\n", - "\n", - "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", - "\n", - "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "y7xiCrOc8mmj" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Upload your index definition file\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Saving azure_index.json to azure_index.json\n" - ] - } - ], - "source": [ - "# If you are running this script locally (not in Google Colab), uncomment the following line\n", - "# and provide the path to your index definition file.\n", - "\n", - "# index_definition_path = '/path_to_your_index_file/azure_index.json' # Local setup: specify your file path here\n", - "\n", - "# If you are running in Google Colab, use the following code to upload the index definition file\n", - "from google.colab import files\n", - "print(\"Upload your index definition file\")\n", - "uploaded = files.upload()\n", - "index_definition_path = list(uploaded.keys())[0]\n", - "\n", - "try:\n", - " with open(index_definition_path, 'r') as file:\n", - " index_definition = json.load(file)\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v_ddPQ_Y8mpm" - }, - "source": [ - "# Creating or Updating Search Indexes\n", - "\n", - "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "bHEpUu1l8msx" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:30:01,070 - INFO - Index 'vector_search_azure' found\n", - "2024-09-06 07:30:01,373 - INFO - Index 'vector_search_azure' already exists. Skipping creation/update.\n" - ] - } - ], - "source": [ - "try:\n", - " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", - "\n", - " # Check if index already exists\n", - " existing_indexes = scope_index_manager.get_all_indexes()\n", - " index_name = index_definition[\"name\"]\n", - "\n", - " if index_name in [index.name for index in existing_indexes]:\n", - " logging.info(f\"Index '{index_name}' found\")\n", - " else:\n", - " logging.info(f\"Creating new index '{index_name}'...\")\n", - "\n", - " # Create SearchIndex object from JSON definition\n", - " search_index = SearchIndex.from_json(index_definition)\n", - "\n", - " # Upsert the index (create if not exists, update if exists)\n", - " scope_index_manager.upsert_index(search_index)\n", - " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", - "\n", - "except QueryIndexAlreadyExistsException:\n", - " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", - "\n", - "except InternalServerFailureException as e:\n", - " error_message = str(e)\n", - " logging.error(f\"InternalServerFailureException raised: {error_message}\")\n", - "\n", - " try:\n", - " # Accessing the response_body attribute from the context\n", - " error_context = e.context\n", - " response_body = error_context.response_body\n", - " if response_body:\n", - " error_details = json.loads(response_body)\n", - " error_message = error_details.get('error', '')\n", - "\n", - " if \"collection: 'azure' doesn't belong to scope: 'shared'\" in error_message:\n", - " raise ValueError(\"Collection 'azure' does not belong to scope 'shared'. Please check the collection and scope names.\")\n", - "\n", - " except ValueError as ve:\n", - " logging.error(str(ve))\n", - " raise\n", - "\n", - " except Exception as json_error:\n", - " logging.error(f\"Failed to parse the error message: {json_error}\")\n", - " raise RuntimeError(f\"Internal server error while creating/updating search index: {error_message}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QRV4k06L8mwS" - }, - "source": [ - "# Load the TREC Dataset\n", - "To build a search engine, we need data to search through. We use the TREC dataset, a well-known benchmark in the field of information retrieval. This dataset contains a wide variety of text data that we'll use to train our search engine. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the data in the TREC dataset make it an excellent choice for testing and refining our search engine, ensuring that it can handle a wide range of queries effectively.\n", - "\n", - "The TREC dataset's rich content allows us to simulate real-world scenarios where users ask complex questions, enabling us to fine-tune our search engine's ability to understand and respond to various types of queries." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "TRfRslF_8mzo" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", - "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", - "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", - "You will be able to reuse this secret in all of your notebooks.\n", - "Please note that authentication is recommended but still optional to access public models or datasets.\n", - " warnings.warn(\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The repository for trec contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/trec.\n", - "You can avoid this prompt in future by passing the argument `trust_remote_code=True`.\n", - "\n", - "Do you wish to run the custom code? [y/N] y\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:30:12,308 - INFO - Successfully loaded TREC dataset with 1000 samples\n" - ] - } - ], - "source": [ - "try:\n", - " trec = load_dataset('trec', split='train[:1000]')\n", - " logging.info(f\"Successfully loaded TREC dataset with {len(trec)} samples\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading TREC dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7FvxRsg38m3G" - }, - "source": [ - "# Creating AzureOpenAI Embeddings\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using AzureOpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "_75ZyCRh8m6m" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:30:13,014 - INFO - Successfully created AzureOpenAIEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = AzureOpenAIEmbeddings(\n", - " deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,\n", - " openai_api_key=AZURE_OPENAI_KEY,\n", - " azure_endpoint=AZURE_OPENAI_ENDPOINT\n", - " )\n", - " logging.info(\"Successfully created AzureOpenAIEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating AzureOpenAIEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8IwZMUnF8m-N" - }, - "source": [ - "# Setting Up the Couchbase Vector Store\n", - "The vector store is set up to manage the embeddings created in the previous step. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is built on top of Couchbase, allowing the script to store the embeddings in a way that can be efficiently searched." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "DwIJQjYT9RV_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:30:14,043 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseSearchVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " index_name=INDEX_NAME,\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C6DJVz7A9RZA" - }, - "source": [ - "# Saving Data to the Vector Store\n", - "With the vector store set up, the next step is to populate it with data. We save the TREC dataset to the vector store in batches. This method is efficient and ensures that our search engine can handle large datasets without running into performance issues. By saving the data in this way, we prepare our search engine to quickly and accurately respond to user queries. This step is essential for making the dataset searchable, transforming raw data into a format that can be easily queried by our search engine.\n", - "\n", - "Batch processing is particularly important when dealing with large datasets, as it prevents memory overload and ensures that the data is stored in a structured and retrievable manner. This approach not only optimizes performance but also ensures the scalability of our system." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "id": "_6opqqvx9Rb_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Processing Batches: 100%|██████████| 20/20 [00:37<00:00, 1.87s/it]\n" - ] - } - ], - "source": [ - "try:\n", - " batch_size = 50\n", - " logging.disable(sys.maxsize) # Disable logging to prevent tqdm output\n", - " for i in tqdm(range(0, len(trec['text']), batch_size), desc=\"Processing Batches\"):\n", - " batch = trec['text'][i:i + batch_size]\n", - " documents = [Document(page_content=text) for text in batch]\n", - " uuids = [str(uuid4()) for _ in range(len(documents))]\n", - " vector_store.add_documents(documents=documents, ids=uuids)\n", - " logging.disable(logging.NOTSET) # Re-enable logging\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Failed to save documents to vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8Pn8-dQw9RfQ" - }, - "source": [ - "# Setting Up a Couchbase Cache\n", - "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", - "\n", - "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "id": "V2y7dyjf9Rid" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:30:52,165 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uehAx36o9Rlm" - }, - "source": [ - "# Using the AzureChatOpenAI Language Model (LLM)\n", - "Language models are AI systems that are trained to understand and generate human language. We'll be using `AzureChatOpenAI` language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", - "\n", - "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "id": "yRAfBRLH9RpO" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:30:52,298 - INFO - Successfully created Azure OpenAI Chat model\n" - ] - } - ], - "source": [ - "try:\n", - " llm = AzureChatOpenAI(\n", - " deployment_name=AZURE_OPENAI_CHAT_DEPLOYMENT,\n", - " openai_api_key=AZURE_OPENAI_KEY,\n", - " azure_endpoint=AZURE_OPENAI_ENDPOINT,\n", - " openai_api_version=\"2024-07-01-preview\"\n", - " )\n", - " logging.info(\"Successfully created Azure OpenAI Chat model\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating Azure OpenAI Chat model: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "k_XDfCx19UvG" - }, - "source": [ - "# Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the similarity_search_with_score method of the CouchbaseSearchVectorStore. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "id": "Pk-oFbnC9Uym" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:30:52,532 - INFO - HTTP Request: POST https://first-couchbase-instance.openai.azure.com//openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 \"HTTP/1.1 200 OK\"\n", - "2024-09-06 07:30:52,839 - INFO - Semantic search completed in 0.53 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 0.53 seconds):\n", - "Distance: 0.9178, Text: Why did the world enter a global depression in 1929 ?\n", - "Distance: 0.8714, Text: When was `` the Great Depression '' ?\n", - "Distance: 0.8113, Text: What crop failure caused the Irish Famine ?\n", - "Distance: 0.7984, Text: What historical event happened in Dogtown in 1899 ?\n", - "Distance: 0.7917, Text: What caused the Lynmouth floods ?\n", - "Distance: 0.7915, Text: When was the first Wall Street Journal published ?\n", - "Distance: 0.7911, Text: When did the Dow first reach ?\n", - "Distance: 0.7885, Text: What were popular songs and types of songs in the 1920s ?\n", - "Distance: 0.7857, Text: When did World War I start ?\n", - "Distance: 0.7842, Text: What caused Harry Houdini 's death ?\n" - ] - } - ], - "source": [ - "query = \"What caused the 1929 Great Depression?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sS0FebHI9U1l" - }, - "source": [ - "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": { - "id": "ZGUXQQmv9ge4" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2024-09-06 07:30:52,860 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "template = \"\"\"You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:\n", - " {context}\n", - " Question: {question}\"\"\"\n", - "prompt = ChatPromptTemplate.from_template(template)\n", - "rag_chain = (\n", - " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", - " | prompt\n", - " | llm\n", - " | StrOutputParser()\n", - ")\n", - "logging.info(\"Successfully created RAG chain\")" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "id": "Mia7XxM9978M" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: The 1929 Great Depression was caused by a combination of factors, including the stock market crash of October 1929, bank failures, reduction in consumer spending and investment, and poor economic policies.\n", - "RAG response generated in 2.32 seconds\n" - ] - } - ], - "source": [ - "# Get responses\n", - "logging.disable(sys.maxsize) # Disable logging to prevent tqdm output\n", - "start_time = time.time()\n", - "rag_response = rag_chain.invoke(query)\n", - "rag_elapsed_time = time.time() - start_time\n", - "\n", - "print(f\"RAG Response: {rag_response}\")\n", - "print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aIdayPzw9glT" - }, - "source": [ - "# Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "id": "0xM2G3ef-GS2" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: Why do heavier objects travel downhill faster?\n", - "Response: Heavier objects travel downhill faster primarily due to the force of gravity acting on them. Gravity accelerates all objects at the same rate, but heavier objects may encounter less air resistance relative to their weight, allowing them to maintain higher speeds as they descend. Additionally, factors such as surface friction and the distribution of mass can influence the speed at which an object travels downhill.\n", - "Time taken: 61.73 seconds\n", - "\n", - "Query 2: What is the capital of France?\n", - "Response: The capital of France is Paris.\n", - "Time taken: 60.63 seconds\n", - "\n", - "Query 3: What caused the 1929 Great Depression?\n", - "Response: The 1929 Great Depression was caused by a combination of factors, including the stock market crash of October 1929, bank failures, reduction in consumer spending and investment, and poor economic policies.\n", - "Time taken: 1.49 seconds\n", - "\n", - "Query 4: Why do heavier objects travel downhill faster?\n", - "Response: Heavier objects travel downhill faster primarily due to the force of gravity acting on them. Gravity accelerates all objects at the same rate, but heavier objects may encounter less air resistance relative to their weight, allowing them to maintain higher speeds as they descend. Additionally, factors such as surface friction and the distribution of mass can influence the speed at which an object travels downhill.\n", - "Time taken: 0.60 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"Why do heavier objects travel downhill faster?\",\n", - " \"What is the capital of France?\",\n", - " \"What caused the 1929 Great Depression?\", # Repeated query\n", - " \"Why do heavier objects travel downhill faster?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response}\")\n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error generating RAG response: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yJQ5P8E29go1" - }, - "source": [ - "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and AzureOpenAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "colab": { - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.0" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/azure/gsi/RAG_with_Couchbase_and_AzureOpenAI.ipynb b/azure/gsi/RAG_with_Couchbase_and_AzureOpenAI.ipynb deleted file mode 100644 index 9d37da37..00000000 --- a/azure/gsi/RAG_with_Couchbase_and_AzureOpenAI.ipynb +++ /dev/null @@ -1,1103 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "kNdImxzypDlm" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [AzureOpenAI](https://azure.microsoft.com/) as the AI-powered embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using GSI( Global Secondary Index) from scratch. Alternatively if you want to perform semantic search using the FTS index, please take a look at [this.](https://developer.couchbase.com/tutorial-azure-openai-couchbase-rag-with-fts/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/azure/gsi/RAG_with_Couchbase_and_AzureOpenAI.ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for Azure OpenAI\n", - "\n", - "Please follow the [instructions](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) to generate the Azure OpenAI credentials.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NH2o6pqa69oG" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and AzureOpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "DYhPj0Ta8l_A" - }, - "outputs": [], - "source": [ - "!pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-openai==0.3.32" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1pp7GtNg8mB9" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "8GzS6tfL8mFP" - }, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import sys\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "from uuid import uuid4\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (\n", - " CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - ")\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from langchain_core.documents import Document\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.output_parsers import StrOutputParser\n", - "from langchain_core.prompts.chat import ChatPromptTemplate\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy\n", - "from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings\n", - "from langchain_couchbase.vectorstores import IndexType\n", - "from tqdm import tqdm" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pBnMp5vb8mIb" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "Yv8kWcuf8mLx" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", - "\n", - "# Suppress verbose HTTP request logging\n", - "logging.getLogger(\"httpx\").setLevel(logging.WARNING)\n", - "logging.getLogger(\"openai\").setLevel(logging.WARNING) \n", - "logging.getLogger(\"urllib3\").setLevel(logging.WARNING)\n", - "logging.getLogger(\"azure\").setLevel(logging.WARNING)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K9G5a0en8mPA" - }, - "source": [ - "# Loading Sensitive Information\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "PFGyHll18mSe" - }, - "outputs": [], - "source": [ - "AZURE_OPENAI_KEY = os.getenv('AZURE_OPENAI_KEY') or getpass.getpass('Enter your Azure OpenAI Key: ')\n", - "AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT') or input('Enter your Azure OpenAI Endpoint: ')\n", - "AZURE_OPENAI_EMBEDDING_DEPLOYMENT = os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT') or input('Enter your Azure OpenAI Embedding Deployment: ')\n", - "AZURE_OPENAI_CHAT_DEPLOYMENT = os.getenv('AZURE_OPENAI_CHAT_DEPLOYMENT') or input('Enter your Azure OpenAI Chat Deployment: ')\n", - "\n", - "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: azure): ') or 'azure'\n", - "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not all([AZURE_OPENAI_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_EMBEDDING_DEPLOYMENT, AZURE_OPENAI_CHAT_DEPLOYMENT]):\n", - " raise ValueError(\"Missing required Azure OpenAI variables\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtGrYzUY8mV3" - }, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "Zb3kK-7W8mZK" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:23:15,245 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C_Gpy32N8mcZ" - }, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "ACZcwUnG8mf2" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:23:20,911 - INFO - Bucket 'query-vector-search-testing' exists.\n", - "2025-09-22 12:23:20,927 - INFO - Collection 'azure' already exists. Skipping creation.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:23:23,264 - INFO - All documents cleared from the collection.\n", - "2025-09-22 12:23:23,265 - INFO - Bucket 'query-vector-search-testing' exists.\n", - "2025-09-22 12:23:23,280 - INFO - Collection 'cache' already exists. Skipping creation.\n", - "2025-09-22 12:23:25,419 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QRV4k06L8mwS" - }, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "TRfRslF_8mzo" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:23:43,453 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7FvxRsg38m3G" - }, - "source": [ - "# Creating AzureOpenAI Embeddings\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using AzureOpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "_75ZyCRh8m6m" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:23:51,333 - INFO - Successfully created AzureOpenAIEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = AzureOpenAIEmbeddings(\n", - " deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,\n", - " openai_api_key=AZURE_OPENAI_KEY,\n", - " azure_endpoint=AZURE_OPENAI_ENDPOINT\n", - " )\n", - " logging.info(\"Successfully created AzureOpenAIEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating AzureOpenAIEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8IwZMUnF8m-N" - }, - "source": [ - "# Setting Up the Couchbase Query Vector Store\n", - "A vector store is where we'll keep our embeddings. The query vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, GSI converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables us to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used.\n", - "\n", - "The vector store requires a distance metric to determine how similarity between vectors is calculated. This is crucial for accurate semantic search results as different distance metrics can yield different similarity rankings. Some of the supported Distance strategies are dot, l2, euclidean, cosine, l2_squared, euclidean_squared. In our implementation we will use cosine which is particularly effective for text embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "DwIJQjYT9RV_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:24:25,546 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseQueryVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding = embeddings,\n", - " distance_metric=DistanceStrategy.COSINE\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C6DJVz7A9RZA" - }, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of 50 articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "3. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 50 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "id": "_6opqqvx9Rb_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:36:18,756 - INFO - Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 50\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uehAx36o9Rlm" - }, - "source": [ - "# Using the AzureChatOpenAI Language Model (LLM)\n", - "Language models are AI systems that are trained to understand and generate human language. We'll be using `AzureChatOpenAI` language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", - "\n", - "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "id": "yRAfBRLH9RpO" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:39:45,695 - INFO - Successfully created Azure OpenAI Chat model\n" - ] - } - ], - "source": [ - "try:\n", - " llm = AzureChatOpenAI(\n", - " deployment_name=AZURE_OPENAI_CHAT_DEPLOYMENT,\n", - " openai_api_key=AZURE_OPENAI_KEY,\n", - " azure_endpoint=AZURE_OPENAI_ENDPOINT,\n", - " openai_api_version=\"2024-10-21\"\n", - " )\n", - " logging.info(\"Successfully created Azure OpenAI Chat model\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating Azure OpenAI Chat model: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "k_XDfCx19UvG" - }, - "source": [ - "# Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": { - "id": "Pk-oFbnC9Uym" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:41:51,036 - INFO - Semantic search completed in 2.55 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 2.55 seconds):\n", - "Distance: 0.3697, Text: The Littler effect - how darts hit the bullseye\n", - "\n", - "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", - "\n", - "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", - "\n", - "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned £200,000 for finishing second - part of £1m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", - "\n", - "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", - "\n", - "Littler beat Luke Humphries to win the Premier League title in May\n", - "\n", - "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like £300,000 for the year. This year it will go to £20m. I expect in five years' time, we'll be playing for £40m.\"\n", - "\n", - "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil ‘The Power’ Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", - "• None Know a lot about Littler? Take our quiz\n", - "Distance: 0.3901, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", - "\n", - "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night – five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", - "\n", - "Littler was hugged by his parents after victory over Meikle\n", - "\n", - "Littler returned to Alexandra Palace to a boisterous reception from more than 3,000 spectators and delivered an astonishing display in the fourth set. He was on for a nine-darter after his opening two throws in both of the first two legs and completed the set in 32 darts - the minimum possible is 27. The teenager will next play after Christmas against European Championship winner Ritchie Edhouse, the 29th seed, or Ian White, and is seeded to meet Humphries in the semi-finals. Having entered last year's event ranked 164th, Littler is up to fourth in the world and will go to number two if he reaches the final again this time. He has won 10 titles in his debut professional year, including the Premier League and Grand Slam of Darts. After reaching the World Championship final as a debutant aged just 16, Littler's life has been transformed and interest in darts has rocketed. Google say he was the most searched-for athlete online in the UK during 2024. This Christmas, more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies and has prompted plans to expand the World Championship. Littler was named BBC Young Sports Personality of the Year on Tuesday and was runner-up to athlete Keely Hodgkinson for the main award.\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Optimizing Vector Search with Global Secondary Index (GSI)\n", - "\n", - "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Global Secondary Index (GSI) in Couchbase.\n", - "\n", - "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", - "\n", - "Hyperscale Vector Indexes (BHIVE)\n", - "- Best for pure vector searches - content discovery, recommendations, semantic search\n", - "- High performance with low memory footprint - designed to scale to billions of vectors\n", - "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", - "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", - "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", - "\n", - "Composite Vector Indexes \n", - "- Best for filtered vector searches - combines vector search with scalar value filtering\n", - "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", - "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", - "\n", - "Choosing the Right Index Type\n", - "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", - "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", - "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", - "\n", - "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", - "\n", - "\n", - "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", - "\n", - "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", - "\n", - "Format: `'IVF[],{PQ|SQ}'`\n", - "\n", - "Centroids (IVF - Inverted File):\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- More centroids = faster search, slower training \n", - "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", - "\n", - "Quantization Options:\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", - "- Higher values = better accuracy, larger index size\n", - "\n", - "Common Examples:\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"azure_bhive_index\",index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The example below shows running the same similarity search, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data.\n", - "\n", - "**Important**: When using Composite indexes, scalar filters take precedence over vector similarity, which can improve performance for filtered searches but may miss some semantically relevant results that don't match the scalar criteria.\n", - "\n", - "Note: In GSI vector search, the distance represents the vector distance between the query and document embeddings. Lower distance indicate higher similarity, while higher distance indicate lower similarity." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:42:10,244 - INFO - Semantic search completed in 1.30 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 1.30 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3697, Text: The Littler effect - how darts hit the bullseye\n", - "\n", - "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", - "\n", - "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", - "\n", - "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned £200,000 for finishing second - part of £1m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", - "\n", - "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", - "\n", - "Littler beat Luke Humphries to win the Premier League title in May\n", - "\n", - "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like £300,000 for the year. This year it will go to £20m. I expect in five years' time, we'll be playing for £40m.\"\n", - "\n", - "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil ‘The Power’ Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", - "• None Know a lot about Littler? Take our quiz\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3901, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", - "\n", - "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night – five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", - "\n", - "Littler was hugged by his parents after victory over Meikle\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80)\n", - "\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80)\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: To create a COMPOSITE index, the below code can be used.\n", - "Choose based on your specific use case and query patterns. For this tutorial's question-answering scenario using the TREC dataset, either index type would work, but BHIVE might be more efficient for pure semantic search across questions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"azure_composite_index\", index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setting Up a Couchbase Cache\n", - "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", - "\n", - "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:42:21,917 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sS0FebHI9U1l" - }, - "source": [ - "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": { - "id": "ZGUXQQmv9ge4" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-16 13:41:05,596 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "# Create RAG prompt template\n", - "rag_prompt = ChatPromptTemplate.from_messages([\n", - " (\"system\", \"You are a helpful assistant that answers questions based on the provided context.\"),\n", - " (\"human\", \"Context: {context}\\n\\nQuestion: {question}\")\n", - "])\n", - "\n", - "# Create RAG chain\n", - "rag_chain = (\n", - " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", - " | rag_prompt\n", - " | llm\n", - " | StrOutputParser()\n", - ")\n", - "logging.info(\"Successfully created RAG chain\")" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": { - "id": "Mia7XxM9978M" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: In his recent PDC World Championship match, Luke Littler achieved several key milestones and records:\n", - "\n", - "1. **Tournament Record Average**: Littler set a tournament record with a 140.91 set average during the fourth and final set of his second-round match against Ryan Meikle.\n", - "\n", - "2. **Nine-Darter Attempt**: He came close to achieving a nine-darter but narrowly missed double 12.\n", - "\n", - "3. **Dramatic Victory**: Littler defeated Meikle 3-1 in a match described as emotionally challenging for the 17-year-old.\n", - "\n", - "4. **Fourth Set Dominance**: In the final set, Littler exploded into life, hitting four maximum 180s and winning three straight legs in 11, 10, and 11 darts.\n", - "\n", - "5. **Overall Set Performance**: He completed the fourth set in 32 darts (the minimum possible is 27) and achieved a match average of 100.85.\n", - "\n", - "These achievements highlight Littler's exceptional talent and his continued rise in professional darts.\n", - "RAG response generated in 5.81 seconds\n" - ] - } - ], - "source": [ - "start_time = time.time()\n", - "# Turn off excessive Logging \n", - "logging.basicConfig(level=logging.WARNING, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", - "\n", - "try:\n", - " rag_response = rag_chain.invoke(query)\n", - " rag_elapsed_time = time.time() - start_time\n", - " print(f\"RAG Response: {rag_response}\")\n", - " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aIdayPzw9glT" - }, - "source": [ - "# Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": { - "id": "0xM2G3ef-GS2" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: What happened in the match between Fullham and Liverpool?\n", - "Response: In the Premier League match between Fulham and Liverpool, the game ended in a 2-2 draw at Anfield. Liverpool played the majority of the game with ten men after Andy Robertson was shown a red card in the 17th minute for denying Harry Wilson a goalscoring opportunity. Despite their numerical disadvantage, Liverpool demonstrated resilience and strong performance.\n", - "\n", - "Fulham took the lead twice during the match, but Liverpool managed to equalize on both occasions. Diogo Jota, returning from injury, scored the crucial 86th-minute equalizer for Liverpool. Even with 10 players, Liverpool maintained over 60% possession and led various attacking metrics, including shots, big chances, and touches in the opposition box. \n", - "\n", - "Fulham's left-back Antonee Robinson praised Liverpool’s performance, stating that it didn’t feel like they had 10 men on the field due to their attacking risks and relentless pressure. Liverpool head coach Arne Slot called his team's performance \"impressive\" and lauded their character and fight in adversity.\n", - "Time taken: 6.69 seconds\n", - "\n", - "Query 2: What were Luke Littler's key achievements and records in his recent PDC World Championship match?\n", - "Response: In his recent PDC World Championship match, Luke Littler achieved several key milestones and records:\n", - "\n", - "1. **Tournament Record Average**: Littler set a tournament record with a 140.91 set average during the fourth and final set of his second-round match against Ryan Meikle.\n", - "\n", - "2. **Nine-Darter Attempt**: He came close to achieving a nine-darter but narrowly missed double 12.\n", - "\n", - "3. **Dramatic Victory**: Littler defeated Meikle 3-1 in a match described as emotionally challenging for the 17-year-old.\n", - "\n", - "4. **Fourth Set Dominance**: In the final set, Littler exploded into life, hitting four maximum 180s and winning three straight legs in 11, 10, and 11 darts.\n", - "\n", - "5. **Overall Set Performance**: He completed the fourth set in 32 darts (the minimum possible is 27) and achieved a match average of 100.85.\n", - "\n", - "These achievements highlight Littler's exceptional talent and his continued rise in professional darts.\n", - "Time taken: 1.09 seconds\n", - "\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"What happened in the match between Fullham and Liverpool?\",\n", - " \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\",\n", - " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - "\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response}\")\n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yJQ5P8E29go1" - }, - "source": [ - "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and AzureOpenAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using GSI which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "colab": { - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/azure/gsi/.env.sample b/azure/query_based/.env.sample similarity index 100% rename from azure/gsi/.env.sample rename to azure/query_based/.env.sample diff --git a/azure/query_based/RAG_with_Couchbase_and_AzureOpenAI.ipynb b/azure/query_based/RAG_with_Couchbase_and_AzureOpenAI.ipynb new file mode 100644 index 00000000..364dc00d --- /dev/null +++ b/azure/query_based/RAG_with_Couchbase_and_AzureOpenAI.ipynb @@ -0,0 +1,1103 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "kNdImxzypDlm" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [AzureOpenAI](https://azure.microsoft.com/) as the AI-powered embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Hyperscale and Composite Vector Indexes from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Search Vector Index, please take a look at [this.](https://developer.couchbase.com/tutorial-azure-openai-couchbase-rag-with-search-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/azure/gsi/RAG_with_Couchbase_and_AzureOpenAI.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for Azure OpenAI\n", + "\n", + "Please follow the [instructions](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) to generate the Azure OpenAI credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NH2o6pqa69oG" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and AzureOpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DYhPj0Ta8l_A" + }, + "outputs": [], + "source": [ + "!pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-openai==0.3.32" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1pp7GtNg8mB9" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "8GzS6tfL8mFP" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import sys\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "from uuid import uuid4\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (\n", + " CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + ")\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from langchain_core.documents import Document\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts.chat import ChatPromptTemplate\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy\n", + "from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings\n", + "from langchain_couchbase.vectorstores import IndexType\n", + "from tqdm import tqdm" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pBnMp5vb8mIb" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "Yv8kWcuf8mLx" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", + "\n", + "# Suppress verbose HTTP request logging\n", + "logging.getLogger(\"httpx\").setLevel(logging.WARNING)\n", + "logging.getLogger(\"openai\").setLevel(logging.WARNING) \n", + "logging.getLogger(\"urllib3\").setLevel(logging.WARNING)\n", + "logging.getLogger(\"azure\").setLevel(logging.WARNING)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K9G5a0en8mPA" + }, + "source": [ + "# Loading Sensitive Information\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "PFGyHll18mSe" + }, + "outputs": [], + "source": [ + "AZURE_OPENAI_KEY = os.getenv('AZURE_OPENAI_KEY') or getpass.getpass('Enter your Azure OpenAI Key: ')\n", + "AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT') or input('Enter your Azure OpenAI Endpoint: ')\n", + "AZURE_OPENAI_EMBEDDING_DEPLOYMENT = os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT') or input('Enter your Azure OpenAI Embedding Deployment: ')\n", + "AZURE_OPENAI_CHAT_DEPLOYMENT = os.getenv('AZURE_OPENAI_CHAT_DEPLOYMENT') or input('Enter your Azure OpenAI Chat Deployment: ')\n", + "\n", + "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: azure): ') or 'azure'\n", + "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not all([AZURE_OPENAI_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_EMBEDDING_DEPLOYMENT, AZURE_OPENAI_CHAT_DEPLOYMENT]):\n", + " raise ValueError(\"Missing required Azure OpenAI variables\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtGrYzUY8mV3" + }, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "Zb3kK-7W8mZK" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:23:15,245 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_Gpy32N8mcZ" + }, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "ACZcwUnG8mf2" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:23:20,911 - INFO - Bucket 'query-vector-search-testing' exists.\n", + "2025-09-22 12:23:20,927 - INFO - Collection 'azure' already exists. Skipping creation.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:23:23,264 - INFO - All documents cleared from the collection.\n", + "2025-09-22 12:23:23,265 - INFO - Bucket 'query-vector-search-testing' exists.\n", + "2025-09-22 12:23:23,280 - INFO - Collection 'cache' already exists. Skipping creation.\n", + "2025-09-22 12:23:25,419 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QRV4k06L8mwS" + }, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "TRfRslF_8mzo" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:23:43,453 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7FvxRsg38m3G" + }, + "source": [ + "# Creating AzureOpenAI Embeddings\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using AzureOpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "_75ZyCRh8m6m" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:23:51,333 - INFO - Successfully created AzureOpenAIEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = AzureOpenAIEmbeddings(\n", + " deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,\n", + " openai_api_key=AZURE_OPENAI_KEY,\n", + " azure_endpoint=AZURE_OPENAI_ENDPOINT\n", + " )\n", + " logging.info(\"Successfully created AzureOpenAIEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating AzureOpenAIEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IwZMUnF8m-N" + }, + "source": [ + "# Setting Up the Couchbase Query Vector Store\n", + "A vector store is where we'll keep our embeddings. The query vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, GSI converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables us to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used.\n", + "\n", + "The vector store requires a distance metric to determine how similarity between vectors is calculated. This is crucial for accurate semantic search results as different distance metrics can yield different similarity rankings. Some of the supported Distance strategies are dot, l2, euclidean, cosine, l2_squared, euclidean_squared. In our implementation we will use cosine which is particularly effective for text embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "DwIJQjYT9RV_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:24:25,546 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding = embeddings,\n", + " distance_metric=DistanceStrategy.COSINE\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6DJVz7A9RZA" + }, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of 50 articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "3. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 50 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "_6opqqvx9Rb_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:36:18,756 - INFO - Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 50\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uehAx36o9Rlm" + }, + "source": [ + "# Using the AzureChatOpenAI Language Model (LLM)\n", + "Language models are AI systems that are trained to understand and generate human language. We'll be using `AzureChatOpenAI` language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", + "\n", + "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "yRAfBRLH9RpO" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:39:45,695 - INFO - Successfully created Azure OpenAI Chat model\n" + ] + } + ], + "source": [ + "try:\n", + " llm = AzureChatOpenAI(\n", + " deployment_name=AZURE_OPENAI_CHAT_DEPLOYMENT,\n", + " openai_api_key=AZURE_OPENAI_KEY,\n", + " azure_endpoint=AZURE_OPENAI_ENDPOINT,\n", + " openai_api_version=\"2024-10-21\"\n", + " )\n", + " logging.info(\"Successfully created Azure OpenAI Chat model\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating Azure OpenAI Chat model: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k_XDfCx19UvG" + }, + "source": [ + "# Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "Pk-oFbnC9Uym" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:41:51,036 - INFO - Semantic search completed in 2.55 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 2.55 seconds):\n", + "Distance: 0.3697, Text: The Littler effect - how darts hit the bullseye\n", + "\n", + "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", + "\n", + "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", + "\n", + "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned \u00a3200,000 for finishing second - part of \u00a31m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", + "\n", + "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", + "\n", + "Littler beat Luke Humphries to win the Premier League title in May\n", + "\n", + "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like \u00a3300,000 for the year. This year it will go to \u00a320m. I expect in five years' time, we'll be playing for \u00a340m.\"\n", + "\n", + "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil \u2018The Power\u2019 Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", + "\u2022 None Know a lot about Littler? Take our quiz\n", + "Distance: 0.3901, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", + "\n", + "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night \u2013 five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", + "\n", + "Littler was hugged by his parents after victory over Meikle\n", + "\n", + "Littler returned to Alexandra Palace to a boisterous reception from more than 3,000 spectators and delivered an astonishing display in the fourth set. He was on for a nine-darter after his opening two throws in both of the first two legs and completed the set in 32 darts - the minimum possible is 27. The teenager will next play after Christmas against European Championship winner Ritchie Edhouse, the 29th seed, or Ian White, and is seeded to meet Humphries in the semi-finals. Having entered last year's event ranked 164th, Littler is up to fourth in the world and will go to number two if he reaches the final again this time. He has won 10 titles in his debut professional year, including the Premier League and Grand Slam of Darts. After reaching the World Championship final as a debutant aged just 16, Littler's life has been transformed and interest in darts has rocketed. Google say he was the most searched-for athlete online in the UK during 2024. This Christmas, more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies and has prompted plans to expand the World Championship. Littler was named BBC Young Sports Personality of the Year on Tuesday and was runner-up to athlete Keely Hodgkinson for the main award.\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", + "\n", + "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Couchbase Hyperscale and Composite Vector Indexes in Couchbase.\n", + "\n", + "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", + "\n", + "Hyperscale Vector Indexes (BHIVE)\n", + "- Best for pure vector searches - content discovery, recommendations, semantic search\n", + "- High performance with low memory footprint - designed to scale to billions of vectors\n", + "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", + "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", + "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", + "\n", + "Composite Vector Indexes \n", + "- Best for filtered vector searches - combines vector search with scalar value filtering\n", + "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", + "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", + "\n", + "Choosing the Right Index Type\n", + "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", + "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", + "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", + "\n", + "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", + "\n", + "\n", + "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", + "\n", + "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", + "\n", + "Format: `'IVF[],{PQ|SQ}'`\n", + "\n", + "Centroids (IVF - Inverted File):\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- More centroids = faster search, slower training \n", + "- Fewer centroids = slower search, faster training\n", + "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "\n", + "Quantization Options:\n", + "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- Higher values = better accuracy, larger index size\n", + "\n", + "Common Examples:\n", + "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", + "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", + "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, Hyperscale and Composite Vector indexes can be created manually from the Couchbase UI." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"azure_bhive_index\",index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The example below shows running the same similarity search, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data.\n", + "\n", + "**Important**: When using Composite indexes, scalar filters take precedence over vector similarity, which can improve performance for filtered searches but may miss some semantically relevant results that don't match the scalar criteria.\n", + "\n", + "Note: In GSI vector search, the distance represents the vector distance between the query and document embeddings. Lower distance indicate higher similarity, while higher distance indicate lower similarity." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:42:10,244 - INFO - Semantic search completed in 1.30 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 1.30 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3697, Text: The Littler effect - how darts hit the bullseye\n", + "\n", + "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", + "\n", + "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", + "\n", + "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned \u00a3200,000 for finishing second - part of \u00a31m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", + "\n", + "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", + "\n", + "Littler beat Luke Humphries to win the Premier League title in May\n", + "\n", + "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like \u00a3300,000 for the year. This year it will go to \u00a320m. I expect in five years' time, we'll be playing for \u00a340m.\"\n", + "\n", + "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil \u2018The Power\u2019 Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", + "\u2022 None Know a lot about Littler? Take our quiz\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3901, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", + "\n", + "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night \u2013 five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", + "\n", + "Littler was hugged by his parents after victory over Meikle\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80)\n", + "\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80)\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: To create a COMPOSITE index, the below code can be used.\n", + "Choose based on your specific use case and query patterns. For this tutorial's question-answering scenario using the TREC dataset, either index type would work, but BHIVE might be more efficient for pure semantic search across questions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"azure_composite_index\", index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting Up a Couchbase Cache\n", + "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", + "\n", + "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:42:21,917 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sS0FebHI9U1l" + }, + "source": [ + "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "id": "ZGUXQQmv9ge4" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-16 13:41:05,596 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "# Create RAG prompt template\n", + "rag_prompt = ChatPromptTemplate.from_messages([\n", + " (\"system\", \"You are a helpful assistant that answers questions based on the provided context.\"),\n", + " (\"human\", \"Context: {context}\\n\\nQuestion: {question}\")\n", + "])\n", + "\n", + "# Create RAG chain\n", + "rag_chain = (\n", + " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", + " | rag_prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")\n", + "logging.info(\"Successfully created RAG chain\")" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "id": "Mia7XxM9978M" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: In his recent PDC World Championship match, Luke Littler achieved several key milestones and records:\n", + "\n", + "1. **Tournament Record Average**: Littler set a tournament record with a 140.91 set average during the fourth and final set of his second-round match against Ryan Meikle.\n", + "\n", + "2. **Nine-Darter Attempt**: He came close to achieving a nine-darter but narrowly missed double 12.\n", + "\n", + "3. **Dramatic Victory**: Littler defeated Meikle 3-1 in a match described as emotionally challenging for the 17-year-old.\n", + "\n", + "4. **Fourth Set Dominance**: In the final set, Littler exploded into life, hitting four maximum 180s and winning three straight legs in 11, 10, and 11 darts.\n", + "\n", + "5. **Overall Set Performance**: He completed the fourth set in 32 darts (the minimum possible is 27) and achieved a match average of 100.85.\n", + "\n", + "These achievements highlight Littler's exceptional talent and his continued rise in professional darts.\n", + "RAG response generated in 5.81 seconds\n" + ] + } + ], + "source": [ + "start_time = time.time()\n", + "# Turn off excessive Logging \n", + "logging.basicConfig(level=logging.WARNING, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", + "\n", + "try:\n", + " rag_response = rag_chain.invoke(query)\n", + " rag_elapsed_time = time.time() - start_time\n", + " print(f\"RAG Response: {rag_response}\")\n", + " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aIdayPzw9glT" + }, + "source": [ + "# Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "id": "0xM2G3ef-GS2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: What happened in the match between Fullham and Liverpool?\n", + "Response: In the Premier League match between Fulham and Liverpool, the game ended in a 2-2 draw at Anfield. Liverpool played the majority of the game with ten men after Andy Robertson was shown a red card in the 17th minute for denying Harry Wilson a goalscoring opportunity. Despite their numerical disadvantage, Liverpool demonstrated resilience and strong performance.\n", + "\n", + "Fulham took the lead twice during the match, but Liverpool managed to equalize on both occasions. Diogo Jota, returning from injury, scored the crucial 86th-minute equalizer for Liverpool. Even with 10 players, Liverpool maintained over 60% possession and led various attacking metrics, including shots, big chances, and touches in the opposition box. \n", + "\n", + "Fulham's left-back Antonee Robinson praised Liverpool\u2019s performance, stating that it didn\u2019t feel like they had 10 men on the field due to their attacking risks and relentless pressure. Liverpool head coach Arne Slot called his team's performance \"impressive\" and lauded their character and fight in adversity.\n", + "Time taken: 6.69 seconds\n", + "\n", + "Query 2: What were Luke Littler's key achievements and records in his recent PDC World Championship match?\n", + "Response: In his recent PDC World Championship match, Luke Littler achieved several key milestones and records:\n", + "\n", + "1. **Tournament Record Average**: Littler set a tournament record with a 140.91 set average during the fourth and final set of his second-round match against Ryan Meikle.\n", + "\n", + "2. **Nine-Darter Attempt**: He came close to achieving a nine-darter but narrowly missed double 12.\n", + "\n", + "3. **Dramatic Victory**: Littler defeated Meikle 3-1 in a match described as emotionally challenging for the 17-year-old.\n", + "\n", + "4. **Fourth Set Dominance**: In the final set, Littler exploded into life, hitting four maximum 180s and winning three straight legs in 11, 10, and 11 darts.\n", + "\n", + "5. **Overall Set Performance**: He completed the fourth set in 32 darts (the minimum possible is 27) and achieved a match average of 100.85.\n", + "\n", + "These achievements highlight Littler's exceptional talent and his continued rise in professional darts.\n", + "Time taken: 1.09 seconds\n", + "\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"What happened in the match between Fullham and Liverpool?\",\n", + " \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\",\n", + " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + "\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response}\")\n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJQ5P8E29go1" + }, + "source": [ + "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and AzureOpenAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using Hyperscale and Composite Vector Indexes which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/azure/gsi/frontmatter.md b/azure/query_based/frontmatter.md similarity index 100% rename from azure/gsi/frontmatter.md rename to azure/query_based/frontmatter.md diff --git a/azure/search_based/RAG_with_Couchbase_and_AzureOpenAI.ipynb b/azure/search_based/RAG_with_Couchbase_and_AzureOpenAI.ipynb new file mode 100644 index 00000000..82dd526a --- /dev/null +++ b/azure/search_based/RAG_with_Couchbase_and_AzureOpenAI.ipynb @@ -0,0 +1,947 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "kNdImxzypDlm" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [AzureOpenAI](https://azure.microsoft.com/) as the AI-powered embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Search Vector Index from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-azure-openai-couchbase-rag-with-hyperscale-or-composite-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/azure/fts/RAG_with_Couchbase_and_AzureOpenAI.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for Azure OpenAI\n", + "\n", + "Please follow the [instructions](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) to generate the Azure OpenAI credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NH2o6pqa69oG" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and AzureOpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "DYhPj0Ta8l_A" + }, + "outputs": [], + "source": [ + "!pip install datasets==3.5.0 langchain-couchbase==0.3.0 langchain-openai==0.3.13" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1pp7GtNg8mB9" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "8GzS6tfL8mFP" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/aayush.tyagi/Documents/AI/vector-search-cookbook/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import sys\n", + "import time\n", + "from datetime import timedelta\n", + "from uuid import uuid4\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (\n", + " CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + ")\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from langchain_core.documents import Document\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", + "from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings\n", + "from tqdm import tqdm" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pBnMp5vb8mIb" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "Yv8kWcuf8mLx" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K9G5a0en8mPA" + }, + "source": [ + "# Loading Sensitive Information\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "PFGyHll18mSe" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Enter your Azure OpenAI Key: \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n", + "Enter your Azure OpenAI Endpoint: https://first-couchbase-instance.openai.azure.com/\n", + "Enter your Azure OpenAI Embedding Deployment: text-embedding-ada-002\n", + "Enter your Azure OpenAI Chat Deployment: gpt-4o\n", + "Enter your Couchbase host (default: couchbase://localhost): couchbases://cb.hlcup4o4jmjr55yf.cloud.couchbase.com\n", + "Enter your Couchbase username (default: Administrator): vector-search-rag-demos\n", + "Enter your Couchbase password (default: password): \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n", + "Enter your Couchbase bucket name (default: vector-search-testing): \n", + "Enter your index name (default: vector_search_azure): \n", + "Enter your scope name (default: shared): \n", + "Enter your collection name (default: azure): \n", + "Enter your cache collection name (default: cache): \n" + ] + } + ], + "source": [ + "AZURE_OPENAI_KEY = getpass.getpass('Enter your Azure OpenAI Key: ')\n", + "AZURE_OPENAI_ENDPOINT = input('Enter your Azure OpenAI Endpoint: ')\n", + "AZURE_OPENAI_EMBEDDING_DEPLOYMENT = input('Enter your Azure OpenAI Embedding Deployment: ')\n", + "AZURE_OPENAI_CHAT_DEPLOYMENT = input('Enter your Azure OpenAI Chat Deployment: ')\n", + "\n", + "CB_HOST = input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", + "INDEX_NAME = input('Enter your index name (default: vector_search_azure): ') or 'vector_search_azure'\n", + "SCOPE_NAME = input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = input('Enter your collection name (default: azure): ') or 'azure'\n", + "CACHE_COLLECTION = input('Enter your cache collection name (default: cache): ') or 'cache'\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not all([AZURE_OPENAI_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_EMBEDDING_DEPLOYMENT, AZURE_OPENAI_CHAT_DEPLOYMENT]):\n", + " raise ValueError(\"Missing required Azure OpenAI variables\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtGrYzUY8mV3" + }, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "Zb3kK-7W8mZK" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:29:16,632 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_Gpy32N8mcZ" + }, + "source": [ + "# Setting Up Collections in Couchbase\n", + "In Couchbase, data is organized in buckets, which can be further divided into scopes and collections. Think of a collection as a table in a traditional SQL database. Before we can store any data, we need to ensure that our collections exist. If they don't, we must create them. This step is important because it prepares the database to handle the specific types of data our application will process. By setting up collections, we define the structure of our data storage, which is essential for efficient data retrieval and management.\n", + "\n", + "Moreover, setting up collections allows us to isolate different types of data within the same bucket, providing a more organized and scalable data structure. This is particularly useful when dealing with large datasets, as it ensures that related data is stored together, making it easier to manage and query." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "ACZcwUnG8mf2" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:29:17,029 - INFO - Collection 'azure' already exists.Skipping creation.\n", + "2024-09-06 07:29:17,095 - INFO - Primary index present or created successfully.\n", + "2024-09-06 07:29:17,775 - INFO - All documents cleared from the collection.\n", + "2024-09-06 07:29:17,841 - INFO - Collection 'cache' already exists.Skipping creation.\n", + "2024-09-06 07:29:17,907 - INFO - Primary index present or created successfully.\n", + "2024-09-06 07:29:17,973 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists.Skipping creation.\")\n", + "\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Ensure primary index exists\n", + " try:\n", + " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", + " logging.info(\"Primary index present or created successfully.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error creating primary index: {str(e)}\")\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + "\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NMJ7RRYp8mjV" + }, + "source": [ + "# Loading Couchbase Vector Search Index\n", + "\n", + "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", + "\n", + "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "y7xiCrOc8mmj" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Upload your index definition file\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saving azure_index.json to azure_index.json\n" + ] + } + ], + "source": [ + "# If you are running this script locally (not in Google Colab), uncomment the following line\n", + "# and provide the path to your index definition file.\n", + "\n", + "# index_definition_path = '/path_to_your_index_file/azure_index.json' # Local setup: specify your file path here\n", + "\n", + "# If you are running in Google Colab, use the following code to upload the index definition file\n", + "from google.colab import files\n", + "print(\"Upload your index definition file\")\n", + "uploaded = files.upload()\n", + "index_definition_path = list(uploaded.keys())[0]\n", + "\n", + "try:\n", + " with open(index_definition_path, 'r') as file:\n", + " index_definition = json.load(file)\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v_ddPQ_Y8mpm" + }, + "source": [ + "# Creating or Updating Search Indexes\n", + "\n", + "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "bHEpUu1l8msx" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:30:01,070 - INFO - Index 'vector_search_azure' found\n", + "2024-09-06 07:30:01,373 - INFO - Index 'vector_search_azure' already exists. Skipping creation/update.\n" + ] + } + ], + "source": [ + "try:\n", + " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", + "\n", + " # Check if index already exists\n", + " existing_indexes = scope_index_manager.get_all_indexes()\n", + " index_name = index_definition[\"name\"]\n", + "\n", + " if index_name in [index.name for index in existing_indexes]:\n", + " logging.info(f\"Index '{index_name}' found\")\n", + " else:\n", + " logging.info(f\"Creating new index '{index_name}'...\")\n", + "\n", + " # Create SearchIndex object from JSON definition\n", + " search_index = SearchIndex.from_json(index_definition)\n", + "\n", + " # Upsert the index (create if not exists, update if exists)\n", + " scope_index_manager.upsert_index(search_index)\n", + " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", + "\n", + "except QueryIndexAlreadyExistsException:\n", + " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", + "\n", + "except InternalServerFailureException as e:\n", + " error_message = str(e)\n", + " logging.error(f\"InternalServerFailureException raised: {error_message}\")\n", + "\n", + " try:\n", + " # Accessing the response_body attribute from the context\n", + " error_context = e.context\n", + " response_body = error_context.response_body\n", + " if response_body:\n", + " error_details = json.loads(response_body)\n", + " error_message = error_details.get('error', '')\n", + "\n", + " if \"collection: 'azure' doesn't belong to scope: 'shared'\" in error_message:\n", + " raise ValueError(\"Collection 'azure' does not belong to scope 'shared'. Please check the collection and scope names.\")\n", + "\n", + " except ValueError as ve:\n", + " logging.error(str(ve))\n", + " raise\n", + "\n", + " except Exception as json_error:\n", + " logging.error(f\"Failed to parse the error message: {json_error}\")\n", + " raise RuntimeError(f\"Internal server error while creating/updating search index: {error_message}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QRV4k06L8mwS" + }, + "source": [ + "# Load the TREC Dataset\n", + "To build a search engine, we need data to search through. We use the TREC dataset, a well-known benchmark in the field of information retrieval. This dataset contains a wide variety of text data that we'll use to train our search engine. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the data in the TREC dataset make it an excellent choice for testing and refining our search engine, ensuring that it can handle a wide range of queries effectively.\n", + "\n", + "The TREC dataset's rich content allows us to simulate real-world scenarios where users ask complex questions, enabling us to fine-tune our search engine's ability to understand and respond to various types of queries." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "TRfRslF_8mzo" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", + "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", + "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", + "You will be able to reuse this secret in all of your notebooks.\n", + "Please note that authentication is recommended but still optional to access public models or datasets.\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The repository for trec contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/trec.\n", + "You can avoid this prompt in future by passing the argument `trust_remote_code=True`.\n", + "\n", + "Do you wish to run the custom code? [y/N] y\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:30:12,308 - INFO - Successfully loaded TREC dataset with 1000 samples\n" + ] + } + ], + "source": [ + "try:\n", + " trec = load_dataset('trec', split='train[:1000]')\n", + " logging.info(f\"Successfully loaded TREC dataset with {len(trec)} samples\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading TREC dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7FvxRsg38m3G" + }, + "source": [ + "# Creating AzureOpenAI Embeddings\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using AzureOpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "_75ZyCRh8m6m" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:30:13,014 - INFO - Successfully created AzureOpenAIEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = AzureOpenAIEmbeddings(\n", + " deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,\n", + " openai_api_key=AZURE_OPENAI_KEY,\n", + " azure_endpoint=AZURE_OPENAI_ENDPOINT\n", + " )\n", + " logging.info(\"Successfully created AzureOpenAIEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating AzureOpenAIEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IwZMUnF8m-N" + }, + "source": [ + "# Setting Up the Couchbase Vector Store\n", + "The vector store is set up to manage the embeddings created in the previous step. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is built on top of Couchbase, allowing the script to store the embeddings in a way that can be efficiently searched." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "DwIJQjYT9RV_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:30:14,043 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " index_name=INDEX_NAME,\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6DJVz7A9RZA" + }, + "source": [ + "# Saving Data to the Vector Store\n", + "With the vector store set up, the next step is to populate it with data. We save the TREC dataset to the vector store in batches. This method is efficient and ensures that our search engine can handle large datasets without running into performance issues. By saving the data in this way, we prepare our search engine to quickly and accurately respond to user queries. This step is essential for making the dataset searchable, transforming raw data into a format that can be easily queried by our search engine.\n", + "\n", + "Batch processing is particularly important when dealing with large datasets, as it prevents memory overload and ensures that the data is stored in a structured and retrievable manner. This approach not only optimizes performance but also ensures the scalability of our system." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "_6opqqvx9Rb_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Processing Batches: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 20/20 [00:37<00:00, 1.87s/it]\n" + ] + } + ], + "source": [ + "try:\n", + " batch_size = 50\n", + " logging.disable(sys.maxsize) # Disable logging to prevent tqdm output\n", + " for i in tqdm(range(0, len(trec['text']), batch_size), desc=\"Processing Batches\"):\n", + " batch = trec['text'][i:i + batch_size]\n", + " documents = [Document(page_content=text) for text in batch]\n", + " uuids = [str(uuid4()) for _ in range(len(documents))]\n", + " vector_store.add_documents(documents=documents, ids=uuids)\n", + " logging.disable(logging.NOTSET) # Re-enable logging\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Failed to save documents to vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Pn8-dQw9RfQ" + }, + "source": [ + "# Setting Up a Couchbase Cache\n", + "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", + "\n", + "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "V2y7dyjf9Rid" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:30:52,165 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uehAx36o9Rlm" + }, + "source": [ + "# Using the AzureChatOpenAI Language Model (LLM)\n", + "Language models are AI systems that are trained to understand and generate human language. We'll be using `AzureChatOpenAI` language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", + "\n", + "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "yRAfBRLH9RpO" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:30:52,298 - INFO - Successfully created Azure OpenAI Chat model\n" + ] + } + ], + "source": [ + "try:\n", + " llm = AzureChatOpenAI(\n", + " deployment_name=AZURE_OPENAI_CHAT_DEPLOYMENT,\n", + " openai_api_key=AZURE_OPENAI_KEY,\n", + " azure_endpoint=AZURE_OPENAI_ENDPOINT,\n", + " openai_api_version=\"2024-07-01-preview\"\n", + " )\n", + " logging.info(\"Successfully created Azure OpenAI Chat model\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating Azure OpenAI Chat model: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k_XDfCx19UvG" + }, + "source": [ + "# Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the similarity_search_with_score method of the CouchbaseSearchVectorStore. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "Pk-oFbnC9Uym" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:30:52,532 - INFO - HTTP Request: POST https://first-couchbase-instance.openai.azure.com//openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 \"HTTP/1.1 200 OK\"\n", + "2024-09-06 07:30:52,839 - INFO - Semantic search completed in 0.53 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 0.53 seconds):\n", + "Distance: 0.9178, Text: Why did the world enter a global depression in 1929 ?\n", + "Distance: 0.8714, Text: When was `` the Great Depression '' ?\n", + "Distance: 0.8113, Text: What crop failure caused the Irish Famine ?\n", + "Distance: 0.7984, Text: What historical event happened in Dogtown in 1899 ?\n", + "Distance: 0.7917, Text: What caused the Lynmouth floods ?\n", + "Distance: 0.7915, Text: When was the first Wall Street Journal published ?\n", + "Distance: 0.7911, Text: When did the Dow first reach ?\n", + "Distance: 0.7885, Text: What were popular songs and types of songs in the 1920s ?\n", + "Distance: 0.7857, Text: When did World War I start ?\n", + "Distance: 0.7842, Text: What caused Harry Houdini 's death ?\n" + ] + } + ], + "source": [ + "query = \"What caused the 1929 Great Depression?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sS0FebHI9U1l" + }, + "source": [ + "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "ZGUXQQmv9ge4" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-06 07:30:52,860 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "template = \"\"\"You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:\n", + " {context}\n", + " Question: {question}\"\"\"\n", + "prompt = ChatPromptTemplate.from_template(template)\n", + "rag_chain = (\n", + " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")\n", + "logging.info(\"Successfully created RAG chain\")" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "Mia7XxM9978M" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: The 1929 Great Depression was caused by a combination of factors, including the stock market crash of October 1929, bank failures, reduction in consumer spending and investment, and poor economic policies.\n", + "RAG response generated in 2.32 seconds\n" + ] + } + ], + "source": [ + "# Get responses\n", + "logging.disable(sys.maxsize) # Disable logging to prevent tqdm output\n", + "start_time = time.time()\n", + "rag_response = rag_chain.invoke(query)\n", + "rag_elapsed_time = time.time() - start_time\n", + "\n", + "print(f\"RAG Response: {rag_response}\")\n", + "print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aIdayPzw9glT" + }, + "source": [ + "# Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "0xM2G3ef-GS2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: Why do heavier objects travel downhill faster?\n", + "Response: Heavier objects travel downhill faster primarily due to the force of gravity acting on them. Gravity accelerates all objects at the same rate, but heavier objects may encounter less air resistance relative to their weight, allowing them to maintain higher speeds as they descend. Additionally, factors such as surface friction and the distribution of mass can influence the speed at which an object travels downhill.\n", + "Time taken: 61.73 seconds\n", + "\n", + "Query 2: What is the capital of France?\n", + "Response: The capital of France is Paris.\n", + "Time taken: 60.63 seconds\n", + "\n", + "Query 3: What caused the 1929 Great Depression?\n", + "Response: The 1929 Great Depression was caused by a combination of factors, including the stock market crash of October 1929, bank failures, reduction in consumer spending and investment, and poor economic policies.\n", + "Time taken: 1.49 seconds\n", + "\n", + "Query 4: Why do heavier objects travel downhill faster?\n", + "Response: Heavier objects travel downhill faster primarily due to the force of gravity acting on them. Gravity accelerates all objects at the same rate, but heavier objects may encounter less air resistance relative to their weight, allowing them to maintain higher speeds as they descend. Additionally, factors such as surface friction and the distribution of mass can influence the speed at which an object travels downhill.\n", + "Time taken: 0.60 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"Why do heavier objects travel downhill faster?\",\n", + " \"What is the capital of France?\",\n", + " \"What caused the 1929 Great Depression?\", # Repeated query\n", + " \"Why do heavier objects travel downhill faster?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response}\")\n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error generating RAG response: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJQ5P8E29go1" + }, + "source": [ + "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and AzureOpenAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.0" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/azure/fts/azure_index.json b/azure/search_based/azure_index.json similarity index 100% rename from azure/fts/azure_index.json rename to azure/search_based/azure_index.json diff --git a/azure/fts/frontmatter.md b/azure/search_based/frontmatter.md similarity index 100% rename from azure/fts/frontmatter.md rename to azure/search_based/frontmatter.md diff --git a/claudeai/fts/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb b/claudeai/fts/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb deleted file mode 100644 index 08d3e782..00000000 --- a/claudeai/fts/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb +++ /dev/null @@ -1,1049 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "kNdImxzypDlm" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com/) as the AI-powered embedding and [Anthropic](https://claude.ai/) as the language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using the FTS service from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-openai-claude-couchbase-rag-with-global-secondary-index/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/claudeai/fts/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for OpenAI and Anthropic\n", - "\n", - "* Please follow the [instructions](https://platform.openai.com/docs/quickstart) to generate the OpenAI credentials.\n", - "* Please follow the [instructions](https://docs.anthropic.com/en/api/getting-started) to generate the Anthropic credentials.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", - "\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NH2o6pqa69oG" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and Claude(by Anthropic) for understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "DYhPj0Ta8l_A" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.3.0 langchain-anthropic==0.3.11 langchain-openai==0.3.13 python-dotenv==1.1.0" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1pp7GtNg8mB9" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "8GzS6tfL8mFP" - }, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "from multiprocessing import AuthenticationError\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - " ServiceUnavailableException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_anthropic import ChatAnthropic\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.prompts.chat import (ChatPromptTemplate,\n", - " HumanMessagePromptTemplate,\n", - " SystemMessagePromptTemplate)\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", - "from langchain_openai import OpenAIEmbeddings" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pBnMp5vb8mIb" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "Yv8kWcuf8mLx" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", - "\n", - "# Disable all logging except critical to prevent OpenAI API request logs\n", - "logging.getLogger(\"httpx\").setLevel(logging.CRITICAL)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K9G5a0en8mPA" - }, - "source": [ - "# Loading Sensitive Informnation\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "PFGyHll18mSe" - }, - "outputs": [], - "source": [ - "load_dotenv()\n", - "\n", - "# Load from environment variables or prompt for input in one-liners\n", - "ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY') or getpass.getpass('Enter your Anthropic API key: ')\n", - "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API key: ')\n", - "CB_HOST = os.getenv('CB_HOST', 'couchbase://localhost') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME', 'Administrator') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD', 'password') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME', 'vector-search-testing') or input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", - "INDEX_NAME = os.getenv('INDEX_NAME', 'vector_search_claude') or input('Enter your index name (default: vector_search_claude): ') or 'vector_search_claude'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME', 'shared') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'claude') or input('Enter your collection name (default: claude): ') or 'claude'\n", - "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION', 'cache') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", - "# Check if the variables are correctly loaded\n", - "if not ANTHROPIC_API_KEY:\n", - " raise ValueError(\"ANTHROPIC_API_KEY is not set in the environment.\")\n", - "if not OPENAI_API_KEY:\n", - " raise ValueError(\"OPENAI_API_KEY is not set in the environment.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtGrYzUY8mV3" - }, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "Zb3kK-7W8mZK" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:48:21,579 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C_Gpy32N8mcZ" - }, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Creates primary index on collection for query performance\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "ACZcwUnG8mf2" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:48:28,237 - INFO - Bucket 'vector-search-testing' does not exist. Creating it...\n", - "2025-02-25 21:48:28,800 - INFO - Bucket 'vector-search-testing' created successfully.\n", - "2025-02-25 21:48:28,802 - INFO - Scope 'shared' does not exist. Creating it...\n", - "2025-02-25 21:48:28,851 - INFO - Scope 'shared' created successfully.\n", - "2025-02-25 21:48:28,855 - INFO - Collection 'claude' does not exist. Creating it...\n", - "2025-02-25 21:48:28,943 - INFO - Collection 'claude' created successfully.\n", - "2025-02-25 21:48:32,802 - INFO - Primary index present or created successfully.\n", - "2025-02-25 21:48:41,954 - INFO - All documents cleared from the collection.\n", - "2025-02-25 21:48:41,955 - INFO - Bucket 'vector-search-testing' exists.\n", - "2025-02-25 21:48:41,959 - INFO - Collection 'cache' does not exist. Creating it...\n", - "2025-02-25 21:48:42,003 - INFO - Collection 'cache' created successfully.\n", - "2025-02-25 21:48:46,902 - INFO - Primary index present or created successfully.\n", - "2025-02-25 21:48:46,904 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Ensure primary index exists\n", - " try:\n", - " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", - " logging.info(\"Primary index present or created successfully.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error creating primary index: {str(e)}\")\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NMJ7RRYp8mjV" - }, - "source": [ - "# Loading Couchbase Vector Search Index\n", - "\n", - "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", - "\n", - "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n", - "\n", - "> Note: Index creation will not fail if used with the concerned bucket(vector-search-testing) instead of travel-sample\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "y7xiCrOc8mmj" - }, - "outputs": [], - "source": [ - "# If you are running this script locally (not in Google Colab), uncomment the following line\n", - "# and provide the path to your index definition file.\n", - "\n", - "# index_definition_path = '/path_to_your_index_file/claude_index.json' # Local setup: specify your file path here\n", - "\n", - "# # Version for Google Colab\n", - "# def load_index_definition_colab():\n", - "# from google.colab import files\n", - "# print(\"Upload your index definition file\")\n", - "# uploaded = files.upload()\n", - "# index_definition_path = list(uploaded.keys())[0]\n", - "\n", - "# try:\n", - "# with open(index_definition_path, 'r') as file:\n", - "# index_definition = json.load(file)\n", - "# return index_definition\n", - "# except Exception as e:\n", - "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Version for Local Environment\n", - "def load_index_definition_local(index_definition_path):\n", - " try:\n", - " with open(index_definition_path, 'r') as file:\n", - " index_definition = json.load(file)\n", - " return index_definition\n", - " except Exception as e:\n", - " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Usage\n", - "# Uncomment the appropriate line based on your environment\n", - "# index_definition = load_index_definition_colab()\n", - "index_definition = load_index_definition_local('claude_index.json')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v_ddPQ_Y8mpm" - }, - "source": [ - "# Creating or Updating Search Indexes\n", - "\n", - "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "bHEpUu1l8msx" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:48:52,980 - INFO - Creating new index 'vector_search_claude'...\n", - "2025-02-25 21:48:53,069 - INFO - Index 'vector_search_claude' successfully created/updated.\n" - ] - } - ], - "source": [ - "try:\n", - " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", - "\n", - " # Check if index already exists\n", - " existing_indexes = scope_index_manager.get_all_indexes()\n", - " index_name = index_definition[\"name\"]\n", - "\n", - " if index_name in [index.name for index in existing_indexes]:\n", - " logging.info(f\"Index '{index_name}' found\")\n", - " else:\n", - " logging.info(f\"Creating new index '{index_name}'...\")\n", - "\n", - " # Create SearchIndex object from JSON definition\n", - " search_index = SearchIndex.from_json(index_definition)\n", - "\n", - " # Upsert the index (create if not exists, update if exists)\n", - " scope_index_manager.upsert_index(search_index)\n", - " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", - "\n", - "except QueryIndexAlreadyExistsException:\n", - " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", - "except ServiceUnavailableException:\n", - " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", - "except InternalServerFailureException as e:\n", - " logging.error(f\"Internal server error: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7FvxRsg38m3G" - }, - "source": [ - "# Creating OpenAI Embeddings\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "_75ZyCRh8m6m" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:48:56,274 - INFO - Successfully created OpenAIEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model='text-embedding-3-small')\n", - " logging.info(\"Successfully created OpenAIEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8IwZMUnF8m-N" - }, - "source": [ - "# Setting Up the Couchbase Vector Store\n", - "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "DwIJQjYT9RV_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:48:59,450 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseSearchVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " index_name=INDEX_NAME,\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:49:09,255 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "3. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 100 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:50:15,064 - INFO - Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 100\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8Pn8-dQw9RfQ" - }, - "source": [ - "# Setting Up a Couchbase Cache\n", - "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", - "\n", - "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "id": "V2y7dyjf9Rid" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:50:48,836 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uehAx36o9Rlm" - }, - "source": [ - "# Using the Claude 4 Sonnet Language Model (LLM)\n", - "Language models are AI systems that are trained to understand and generate human language. We'll be using the `Claude 4 Sonnet` language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", - "\n", - "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yRAfBRLH9RpO" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:50:52,173 - INFO - Successfully created ChatAnthropic\n" - ] - } - ], - "source": [ - "try:\n", - " llm = ChatAnthropic(temperature=0.1, anthropic_api_key=ANTHROPIC_API_KEY, model_name='claude-sonnet-4-20250514') \n", - " logging.info(\"Successfully created ChatAnthropic\")\n", - "except Exception as e:\n", - " logging.error(f\"Error creating ChatAnthropic: {str(e)}. Please check your API key and network connection.\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "k_XDfCx19UvG" - }, - "source": [ - "# Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. \n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the similarity_search_with_score method of the CouchbaseSearchVectorStore. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": { - "id": "Pk-oFbnC9Uym" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:53:55,462 - INFO - Semantic search completed in 0.55 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 0.55 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.7498, Text: A map shown during the draw for the 2026 Fifa World Cup has been criticised by Ukraine as an \"unacceptable error\" after it appeared to exclude Crimea as part of the country. The graphic - showing countries that cannot be drawn to play each other for geopolitical reasons - highlighted Ukraine but did not include the peninsula that is internationally recognised to be part of it. Crimea has been under Russian occupation since 2014 and just a handful of countries recognise the peninsula as Russian territory. Ukraine Foreign Ministry spokesman Heorhiy Tykhy said that the nation expects \"a public apology\". Fifa said it was \"aware of an issue\" and the image had been removed.\n", - "\n", - "Writing on X, Tykhy said that Fifa had not only \"acted against international law\" but had also \"supported Russian propaganda, war crimes, and the crime of aggression against Ukraine\". He added a \"fixed\" version of the map to his post, highlighting Crimea as part of Ukraine's territory. Among the countries that cannot play each other are Ukraine and Belarus, Spain and Gibraltar and Kosovo versus either Bosnia and Herzegovina or Serbia.\n", - "\n", - "This Twitter post cannot be displayed in your browser. Please enable Javascript or try a different browser. View original content on Twitter The BBC is not responsible for the content of external sites. Skip twitter post by Heorhii Tykhyi This article contains content provided by Twitter. We ask for your permission before anything is loaded, as they may be using cookies and other technologies. You may want to read Twitter’s cookie policy, external and privacy policy, external before accepting. To view this content choose ‘accept and continue’. The BBC is not responsible for the content of external sites.\n", - "\n", - "The Ukrainian Football Association has also sent a letter to Fifa secretary-general Mathias Grafström and UEFA secretary-general Theodore Theodoridis over the matter. \"We appeal to you to express our deep concern about the infographic map [shown] on December 13, 2024,\" the letter reads. \"Taking into account a number of official decisions and resolutions adopted by the Fifa Council and the UEFA executive committee since 2014... we emphasize that today's version of the cartographic image of Ukraine... is completely unacceptable and looks like an inconsistent position of Fifa and UEFA.\" The 2026 World Cup will start on 11 June that year in Mexico City and end on 19 July in New Jersey. The expanded 48-team tournament will last a record 39 days. Ukraine were placed in Group D alongside Iceland, Azerbaijan and the yet-to-be-determined winners of France's Nations League quarter-final against Croatia.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.4302, Text: Defending champions Manchester City will face Juventus in the group stage of the Fifa Club World Cup next summer, while Chelsea meet Brazilian side Flamengo. Pep Guardiola's City, who beat Brazilian side Fluminense to win the tournament for the first time in 2023, begin their title defence against Morocco's Wydad and also play Al Ain of the United Arab Emirates in Group G. Chelsea, winners of the 2021 final, were also drawn alongside Mexico's Club Leon and Tunisian side Esperance Sportive de Tunisie in Group D. The revamped Fifa Club World Cup, which has been expanded to 32 teams, will take place in the United States between 15 June and 13 July next year.\n", - "\n", - "A complex and lengthy draw ceremony was held across two separate Miami locations and lasted more than 90 minutes, during which a new Club World Cup trophy was revealed. There was also a video message from incoming US president Donald Trump, whose daughter Ivanka drew the first team. Lionel Messi's Inter Miami will take on Egyptian side Al Ahly at the Hard Rock Stadium in the opening match, staged in Miami. Elsewhere, Paris St-Germain were drawn against Atletico Madrid in Group B, while Bayern Munich meet Benfica in another all-European group-stage match-up. Teams will play each other once in the group phase and the top two will progress to the knockout stage.\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. What is the Club World Cup?\n", - "\n", - "Teams from each of the six international football confederations will be represented at next summer's tournament, including 12 European clubs - the highest quota of any confederation. The European places were decided by clubs' Champions League performances over the past four seasons, with recent winners Chelsea, Manchester City and Real Madrid guaranteed places. Al Ain, the most successful club in the UAE with 14 league titles, are owned by the country's president Sheikh Mohamed bin Zayed Al Nahyan - the older brother of City owner Sheikh Mansour. Real, who lifted the Fifa Club World Cup trophy for a record-extending fifth time in 2022, will open up against Saudi Pro League champions Al-Hilal, who currently have Neymar in their ranks. One place was reserved for a club from the host nation, which Fifa controversially awarded to Inter Miami, who will contest the tournament curtain-raiser. Messi's side were winners of the regular-season MLS Supporters' Shield but beaten in the MLS play-offs, meaning they are not this season's champions.\n", - "• None How does the new Club World Cup work & why is it so controversial?\n", - "\n", - "Matches will be played across 12 venues in the US which, alongside Canada and Mexico, also host the 2026 World Cup. Fifa is facing legal action from player unions and leagues about the scheduling of the event, which begins two weeks after the Champions League final at the end of the 2024-25 European calendar and ends five weeks before the first Premier League match of the 2025-2026 season. But football's world governing body believes the dates allow sufficient rest time before the start of the domestic campaigns. The Club World Cup will now take place once every four years, when it was previously held annually and involved just seven teams. Streaming platform DAZN has secured exclusive rights to broadcast next summer's tournament, during which 63 matches will take place over 29 days.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.4207, Text: After Fifa awards Saudi Arabia the hosting rights for the men's 2034 World Cup, BBC analysis editor Ros Atkins looks at how we got here and the controversies surrounding the decision.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.4123, Text: FA still to decide on endorsing Saudi World Cup bid\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80) # Add separator line\n", - " for doc, score in search_results:\n", - " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80) # Add separator between results\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sS0FebHI9U1l" - }, - "source": [ - "# Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "id": "ZGUXQQmv9ge4" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-25 21:54:00,781 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "system_template = \"You are a helpful assistant that answers questions based on the provided context.\"\n", - "system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)\n", - "\n", - "human_template = \"Context: {context}\\n\\nQuestion: {question}\"\n", - "human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)\n", - "\n", - "chat_prompt = ChatPromptTemplate.from_messages([\n", - " system_message_prompt,\n", - " human_message_prompt\n", - "])\n", - "\n", - "def format_docs(docs):\n", - " return \"\\n\\n\".join(doc.page_content for doc in docs)\n", - "\n", - "rag_chain = (\n", - " {\"context\": lambda x: format_docs(vector_store.similarity_search(x)), \"question\": RunnablePassthrough()}\n", - " | chat_prompt\n", - " | llm\n", - ")\n", - "logging.info(\"Successfully created RAG chain\")" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "id": "Mia7XxM9978M" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: During the draw for the 2026 FIFA World Cup, a map was shown that excluded Crimea as part of Ukraine. This graphic, which was displaying countries that cannot be drawn to play each other for geopolitical reasons, highlighted Ukraine but did not include the Crimean peninsula, which is internationally recognized as Ukrainian territory.\n", - "\n", - "This omission sparked significant controversy because Crimea has been under Russian occupation since 2014, but only a handful of countries recognize it as Russian territory. The Ukrainian Foreign Ministry spokesman, Heorhiy Tykhy, called this an \"unacceptable error\" and stated that Ukraine expected \"a public apology\" from FIFA. He criticized FIFA for acting \"against international law\" and supporting \"Russian propaganda, war crimes, and the crime of aggression against Ukraine.\"\n", - "\n", - "The Ukrainian Football Association also sent a formal letter of complaint to FIFA and UEFA officials expressing their \"deep concern\" about the cartographic representation. FIFA acknowledged they were \"aware of an issue\" and subsequently removed the image.\n", - "RAG response generated in 6.58 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " start_time = time.time()\n", - " rag_response = rag_chain.invoke(query)\n", - " rag_elapsed_time = time.time() - start_time\n", - "\n", - " print(f\"RAG Response: {rag_response.content}\")\n", - " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", - "except AuthenticationError as e:\n", - " print(f\"Authentication error: {str(e)}\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aIdayPzw9glT" - }, - "source": [ - "# Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": { - "id": "0xM2G3ef-GS2" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\n", - "Response: According to the context, Apple Intelligence (an AI feature that summarizes notifications) generated a false headline that made it appear as if BBC News had published an article claiming Luigi Mangione, who was arrested for the murder of healthcare insurance CEO Brian Thompson in New York, had shot himself. This was completely false - Mangione had not shot himself.\n", - "\n", - "The BBC complained to Apple about this misrepresentation, with a BBC spokesperson stating they had \"contacted Apple to raise this concern and fix the problem.\" The BBC emphasized that as \"the most trusted news media in the world,\" it's essential that audiences can trust information published in their name, including notifications.\n", - "\n", - "This wasn't an isolated incident - the context mentions that Apple's AI feature also misrepresented a New York Times article, incorrectly summarizing it as \"Netanyahu arrested\" when the actual article was about the International Criminal Court issuing an arrest warrant for the Israeli prime minister.\n", - "Time taken: 6.66 seconds\n", - "\n", - "Query 2: What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\n", - "Response: During the draw for the 2026 FIFA World Cup, a map was shown that excluded Crimea as part of Ukraine. This graphic, which was displaying countries that cannot be drawn to play each other for geopolitical reasons, highlighted Ukraine but did not include the Crimean peninsula, which is internationally recognized as Ukrainian territory.\n", - "\n", - "This omission sparked significant controversy because Crimea has been under Russian occupation since 2014, but only a handful of countries recognize it as Russian territory. The Ukrainian Foreign Ministry spokesman, Heorhiy Tykhy, called this an \"unacceptable error\" and stated that Ukraine expected \"a public apology\" from FIFA. He criticized FIFA for acting \"against international law\" and supporting \"Russian propaganda, war crimes, and the crime of aggression against Ukraine.\"\n", - "\n", - "The Ukrainian Football Association also sent a formal letter of complaint to FIFA and UEFA officials expressing their \"deep concern\" about the cartographic representation. FIFA acknowledged they were \"aware of an issue\" and subsequently removed the image.\n", - "Time taken: 0.62 seconds\n", - "\n", - "Query 3: What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\n", - "Response: According to the context, Apple Intelligence (an AI feature that summarizes notifications) generated a false headline that made it appear as if BBC News had published an article claiming Luigi Mangione, who was arrested for the murder of healthcare insurance CEO Brian Thompson in New York, had shot himself. This was completely false - Mangione had not shot himself.\n", - "\n", - "The BBC complained to Apple about this misrepresentation, with a BBC spokesperson stating they had \"contacted Apple to raise this concern and fix the problem.\" The BBC emphasized that as \"the most trusted news media in the world,\" it's essential that audiences can trust information published in their name, including notifications.\n", - "\n", - "This wasn't an isolated incident - the context mentions that Apple's AI feature also misrepresented a New York Times article, incorrectly summarizing it as \"Netanyahu arrested\" when the actual article was about the International Criminal Court issuing an arrest warrant for the Israeli prime minister.\n", - "Time taken: 0.51 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\",\n", - " \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\", # Repeated query\n", - " \"What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - "\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response.content}\")\n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "except AuthenticationError as e:\n", - " print(f\"Authentication error: {str(e)}\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yJQ5P8E29go1" - }, - "source": [ - "## Conclusion\n", - "By following these steps, you’ll have a fully functional semantic search engine that leverages the strengths of Couchbase and Claude(by Anthropic). This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you’re a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "colab": { - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/claudeai/gsi/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb b/claudeai/gsi/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb deleted file mode 100644 index 3efc06d0..00000000 --- a/claudeai/gsi/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb +++ /dev/null @@ -1,1089 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "kNdImxzypDlm" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com/) as the AI-powered embedding and [Anthropic](https://claude.ai/) as the language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using GSI( Global Secondary Index) from scratch. Alternatively if you want to perform semantic search using the FTS index, please take a look at [this.](https://developer.couchbase.com/tutorial-openai-claude-couchbase-rag-with-fts/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/claudeai/gsi/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for OpenAI and Anthropic\n", - "\n", - "* Please follow the [instructions](https://platform.openai.com/docs/quickstart) to generate the OpenAI credentials.\n", - "* Please follow the [instructions](https://docs.anthropic.com/en/api/getting-started) to generate the Anthropic credentials.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", - "\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NH2o6pqa69oG" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and Claude(by Anthropic) for understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "DYhPj0Ta8l_A" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-anthropic==0.3.19 langchain-openai==0.3.32 python-dotenv==1.1.1" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1pp7GtNg8mB9" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": { - "id": "8GzS6tfL8mFP" - }, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "from multiprocessing import AuthenticationError\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - " ServiceUnavailableException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_anthropic import ChatAnthropic\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.prompts.chat import (ChatPromptTemplate,\n", - " HumanMessagePromptTemplate,\n", - " SystemMessagePromptTemplate)\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy\n", - "from langchain_openai import OpenAIEmbeddings\n", - "from langchain_couchbase.vectorstores import IndexType" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pBnMp5vb8mIb" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "Yv8kWcuf8mLx" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", - "\n", - "# Disable all logging except critical to prevent OpenAI API request logs\n", - "logging.getLogger(\"httpx\").setLevel(logging.CRITICAL)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K9G5a0en8mPA" - }, - "source": [ - "# Loading Sensitive Informnation\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "PFGyHll18mSe" - }, - "outputs": [], - "source": [ - "load_dotenv()\n", - "\n", - "# Load from environment variables or prompt for input in one-liners\n", - "ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY') or getpass.getpass('Enter your Anthropic API key: ')\n", - "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API key: ')\n", - "CB_HOST = os.getenv('CB_HOST', 'couchbase://localhost') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME', 'Administrator') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD', 'password') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME', 'query-vector-search-testing') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME', 'shared') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'claude') or input('Enter your collection name (default: claude): ') or 'claude'\n", - "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION', 'cache') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", - "# Check if the variables are correctly loaded\n", - "if not ANTHROPIC_API_KEY:\n", - " raise ValueError(\"ANTHROPIC_API_KEY is not set in the environment.\")\n", - "if not OPENAI_API_KEY:\n", - " raise ValueError(\"OPENAI_API_KEY is not set in the environment.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtGrYzUY8mV3" - }, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "Zb3kK-7W8mZK" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:15:22,899 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C_Gpy32N8mcZ" - }, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "ACZcwUnG8mf2" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:15:26,795 - INFO - Bucket 'query-vector-search-testing' exists.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:15:26,808 - INFO - Collection 'claude' does not exist. Creating it...\n", - "2025-09-09 12:15:26,854 - INFO - Collection 'claude' created successfully.\n", - "2025-09-09 12:15:29,065 - INFO - All documents cleared from the collection.\n", - "2025-09-09 12:15:29,066 - INFO - Bucket 'query-vector-search-testing' exists.\n", - "2025-09-09 12:15:29,074 - INFO - Collection 'cache' already exists. Skipping creation.\n", - "2025-09-09 12:15:31,115 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7FvxRsg38m3G" - }, - "source": [ - "# Creating OpenAI Embeddings\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "_75ZyCRh8m6m" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:15:54,388 - INFO - Successfully created OpenAIEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model='text-embedding-3-small')\n", - " logging.info(\"Successfully created OpenAIEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8IwZMUnF8m-N" - }, - "source": [ - "# Setting Up the Couchbase Query Vector Store\n", - "A vector store is where we'll keep our embeddings. The query vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, GSI converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables us to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used.\n", - "\n", - "The vector store requires a distance metric to determine how similarity between vectors is calculated. This is crucial for accurate semantic search results as different distance metrics can yield different similarity rankings. Some of the supported Distance strategies are dot, l2, euclidean, cosine, l2_squared, euclidean_squared. In our implementation we will use cosine which is particularly effective for text embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "id": "DwIJQjYT9RV_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:16:02,578 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseQueryVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding = embeddings,\n", - " distance_metric=DistanceStrategy.COSINE\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:16:16,461 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "3. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 100 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:18:40,320 - INFO - Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 100\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8Pn8-dQw9RfQ" - }, - "source": [ - "# Setting Up a Couchbase Cache\n", - "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", - "\n", - "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": { - "id": "V2y7dyjf9Rid" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:18:47,269 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uehAx36o9Rlm" - }, - "source": [ - "# Using the Claude 4 Sonnet Language Model (LLM)\n", - "Language models are AI systems that are trained to understand and generate human language. We'll be using the `Claude 4 Sonnet` language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", - "\n", - "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yRAfBRLH9RpO" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:20:36,212 - INFO - Successfully created ChatAnthropic\n" - ] - } - ], - "source": [ - "try:\n", - " llm = ChatAnthropic(temperature=0.1, anthropic_api_key=ANTHROPIC_API_KEY, model_name='claude-sonnet-4-20250514') \n", - " logging.info(\"Successfully created ChatAnthropic\")\n", - "except Exception as e:\n", - " logging.error(f\"Error creating ChatAnthropic: {str(e)}. Please check your API key and network connection.\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "k_XDfCx19UvG" - }, - "source": [ - "# Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "id": "Pk-oFbnC9Uym" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:21:34,292 - INFO - Semantic search completed in 1.91 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 1.91 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.2502, Text: A map shown during the draw for the 2026 Fifa World Cup has been criticised by Ukraine as an \"unacceptable error\" after it appeared to exclude Crimea as part of the country. The graphic - showing countries that cannot be drawn to play each other for geopolitical reasons - highlighted Ukraine but did not include the peninsula that is internationally recognised to be part of it. Crimea has been under Russian occupation since 2014 and just a handful of countries recognise the peninsula as Russian territory. Ukraine Foreign Ministry spokesman Heorhiy Tykhy said that the nation expects \"a public apology\". Fifa said it was \"aware of an issue\" and the image had been removed.\n", - "\n", - "Writing on X, Tykhy said that Fifa had not only \"acted against international law\" but had also \"supported Russian propaganda, war crimes, and the crime of aggression against Ukraine\". He added a \"fixed\" version of the map to his post, highlighting Crimea as part of Ukraine's territory. Among the countries that cannot play each other are Ukraine and Belarus, Spain and Gibraltar and Kosovo versus either Bosnia and Herzegovina or Serbia.\n", - "\n", - "This Twitter post cannot be displayed in your browser. Please enable Javascript or try a different browser. View original content on Twitter The BBC is not responsible for the content of external sites. Skip twitter post by Heorhii Tykhyi This article contains content provided by Twitter. We ask for your permission before anything is loaded, as they may be using cookies and other technologies. You may want to read Twitter’s cookie policy, external and privacy policy, external before accepting. To view this content choose ‘accept and continue’. The BBC is not responsible for the content of external sites.\n", - "\n", - "The Ukrainian Football Association has also sent a letter to Fifa secretary-general Mathias Grafström and UEFA secretary-general Theodore Theodoridis over the matter. \"We appeal to you to express our deep concern about the infographic map [shown] on December 13, 2024,\" the letter reads. \"Taking into account a number of official decisions and resolutions adopted by the Fifa Council and the UEFA executive committee since 2014... we emphasize that today's version of the cartographic image of Ukraine... is completely unacceptable and looks like an inconsistent position of Fifa and UEFA.\" The 2026 World Cup will start on 11 June that year in Mexico City and end on 19 July in New Jersey. The expanded 48-team tournament will last a record 39 days. Ukraine were placed in Group D alongside Iceland, Azerbaijan and the yet-to-be-determined winners of France's Nations League quarter-final against Croatia.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.5698, Text: Defending champions Manchester City will face Juventus in the group stage of the Fifa Club World Cup next summer, while Chelsea meet Brazilian side Flamengo. Pep Guardiola's City, who beat Brazilian side Fluminense to win the tournament for the first time in 2023, begin their title defence against Morocco's Wydad and also play Al Ain of the United Arab Emirates in Group G. Chelsea, winners of the 2021 final, were also drawn alongside Mexico's Club Leon and Tunisian side Esperance Sportive de Tunisie in Group D. The revamped Fifa Club World Cup, which has been expanded to 32 teams, will take place in the United States between 15 June and 13 July next year.\n", - "\n", - "A complex and lengthy draw ceremony was held across two separate Miami locations and lasted more than 90 minutes, during which a new Club World Cup trophy was revealed. There was also a video message from incoming US president Donald Trump, whose daughter Ivanka drew the first team. Lionel Messi's Inter Miami will take on Egyptian side Al Ahly at the Hard Rock Stadium in the opening match, staged in Miami. Elsewhere, Paris St-Germain were drawn against Atletico Madrid in Group B, while Bayern Munich meet Benfica in another all-European group-stage match-up. Teams will play each other once in the group phase and the top two will progress to the knockout stage.\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. What is the Club World Cup?\n", - "\n", - "Teams from each of the six international football confederations will be represented at next summer's tournament, including 12 European clubs - the highest quota of any confederation. The European places were decided by clubs' Champions League performances over the past four seasons, with recent winners Chelsea, Manchester City and Real Madrid guaranteed places. Al Ain, the most successful club in the UAE with 14 league titles, are owned by the country's president Sheikh Mohamed bin Zayed Al Nahyan - the older brother of City owner Sheikh Mansour. Real, who lifted the Fifa Club World Cup trophy for a record-extending fifth time in 2022, will open up against Saudi Pro League champions Al-Hilal, who currently have Neymar in their ranks. One place was reserved for a club from the host nation, which Fifa controversially awarded to Inter Miami, who will contest the tournament curtain-raiser. Messi's side were winners of the regular-season MLS Supporters' Shield but beaten in the MLS play-offs, meaning they are not this season's champions.\n", - "• None How does the new Club World Cup work & why is it so controversial?\n", - "\n", - "Matches will be played across 12 venues in the US which, alongside Canada and Mexico, also host the 2026 World Cup. Fifa is facing legal action from player unions and leagues about the scheduling of the event, which begins two weeks after the Champions League final at the end of the 2024-25 European calendar and ends five weeks before the first Premier League match of the 2025-2026 season. But football's world governing body believes the dates allow sufficient rest time before the start of the domestic campaigns. The Club World Cup will now take place once every four years, when it was previously held annually and involved just seven teams. Streaming platform DAZN has secured exclusive rights to broadcast next summer's tournament, during which 63 matches will take place over 29 days.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.5792, Text: After Fifa awards Saudi Arabia the hosting rights for the men's 2034 World Cup, BBC analysis editor Ros Atkins looks at how we got here and the controversies surrounding the decision.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.5877, Text: FA still to decide on endorsing Saudi World Cup bid\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80) # Add separator line\n", - " for doc, score in search_results:\n", - " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80) # Add separator between results\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Optimizing Vector Search with Global Secondary Index (GSI)\n", - "\n", - "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Global Secondary Index (GSI) in Couchbase.\n", - "\n", - "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", - "\n", - "Hyperscale Vector Indexes (BHIVE)\n", - "- Best for pure vector searches - content discovery, recommendations, semantic search\n", - "- High performance with low memory footprint - designed to scale to billions of vectors\n", - "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", - "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", - "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", - "\n", - "Composite Vector Indexes \n", - "- Best for filtered vector searches - combines vector search with scalar value filtering\n", - "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", - "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", - "\n", - "Choosing the Right Index Type\n", - "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", - "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", - "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", - "\n", - "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", - "\n", - "\n", - "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", - "\n", - "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", - "\n", - "Format: `'IVF[],{PQ|SQ}'`\n", - "\n", - "Centroids (IVF - Inverted File):\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- More centroids = faster search, slower training \n", - "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", - "\n", - "Quantization Options:\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", - "- Higher values = better accuracy, larger index size\n", - "\n", - "Common Examples:\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"claude_bhive_index\",index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:26:01,504 - INFO - Semantic search completed in 0.44 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 0.44 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.2502, Text: A map shown during the draw for the 2026 Fifa World Cup has been criticised by Ukraine as an \"unacceptable error\" after it appeared to exclude Crimea as part of the country. The graphic - showing countries that cannot be drawn to play each other for geopolitical reasons - highlighted Ukraine but did not include the peninsula that is internationally recognised to be part of it. Crimea has been under Russian occupation since 2014 and just a handful of countries recognise the peninsula as Russian territory. Ukraine Foreign Ministry spokesman Heorhiy Tykhy said that the nation expects \"a public apology\". Fifa said it was \"aware of an issue\" and the image had been removed.\n", - "\n", - "Writing on X, Tykhy said that Fifa had not only \"acted against international law\" but had also \"supported Russian propaganda, war crimes, and the crime of aggression against Ukraine\". He added a \"fixed\" version of the map to his post, highlighting Crimea as part of Ukraine's territory. Among the countries that cannot play each other are Ukraine and Belarus, Spain and Gibraltar and Kosovo versus either Bosnia and Herzegovina or Serbia.\n", - "\n", - "This Twitter post cannot be displayed in your browser. Please enable Javascript or try a different browser. View original content on Twitter The BBC is not responsible for the content of external sites. Skip twitter post by Heorhii Tykhyi This article contains content provided by Twitter. We ask for your permission before anything is loaded, as they may be using cookies and other technologies. You may want to read Twitter’s cookie policy, external and privacy policy, external before accepting. To view this content choose ‘accept and continue’. The BBC is not responsible for the content of external sites.\n", - "\n", - "The Ukrainian Football Association has also sent a letter to Fifa secretary-general Mathias Grafström and UEFA secretary-general Theodore Theodoridis over the matter. \"We appeal to you to express our deep concern about the infographic map [shown] on December 13, 2024,\" the letter reads. \"Taking into account a number of official decisions and resolutions adopted by the Fifa Council and the UEFA executive committee since 2014... we emphasize that today's version of the cartographic image of Ukraine... is completely unacceptable and looks like an inconsistent position of Fifa and UEFA.\" The 2026 World Cup will start on 11 June that year in Mexico City and end on 19 July in New Jersey. The expanded 48-team tournament will last a record 39 days. Ukraine were placed in Group D alongside Iceland, Azerbaijan and the yet-to-be-determined winners of France's Nations League quarter-final against Croatia.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.5698, Text: Defending champions Manchester City will face Juventus in the group stage of the Fifa Club World Cup next summer, while Chelsea meet Brazilian side Flamengo. Pep Guardiola's City, who beat Brazilian side Fluminense to win the tournament for the first time in 2023, begin their title defence against Morocco's Wydad and also play Al Ain of the United Arab Emirates in Group G. Chelsea, winners of the 2021 final, were also drawn alongside Mexico's Club Leon and Tunisian side Esperance Sportive de Tunisie in Group D. The revamped Fifa Club World Cup, which has been expanded to 32 teams, will take place in the United States between 15 June and 13 July next year.\n", - "\n", - "A complex and lengthy draw ceremony was held across two separate Miami locations and lasted more than 90 minutes, during which a new Club World Cup trophy was revealed. There was also a video message from incoming US president Donald Trump, whose daughter Ivanka drew the first team. Lionel Messi's Inter Miami will take on Egyptian side Al Ahly at the Hard Rock Stadium in the opening match, staged in Miami. Elsewhere, Paris St-Germain were drawn against Atletico Madrid in Group B, while Bayern Munich meet Benfica in another all-European group-stage match-up. Teams will play each other once in the group phase and the top two will progress to the knockout stage.\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. What is the Club World Cup?\n", - "\n", - "Teams from each of the six international football confederations will be represented at next summer's tournament, including 12 European clubs - the highest quota of any confederation. The European places were decided by clubs' Champions League performances over the past four seasons, with recent winners Chelsea, Manchester City and Real Madrid guaranteed places. Al Ain, the most successful club in the UAE with 14 league titles, are owned by the country's president Sheikh Mohamed bin Zayed Al Nahyan - the older brother of City owner Sheikh Mansour. Real, who lifted the Fifa Club World Cup trophy for a record-extending fifth time in 2022, will open up against Saudi Pro League champions Al-Hilal, who currently have Neymar in their ranks. One place was reserved for a club from the host nation, which Fifa controversially awarded to Inter Miami, who will contest the tournament curtain-raiser. Messi's side were winners of the regular-season MLS Supporters' Shield but beaten in the MLS play-offs, meaning they are not this season's champions.\n", - "• None How does the new Club World Cup work & why is it so controversial?\n", - "\n", - "Matches will be played across 12 venues in the US which, alongside Canada and Mexico, also host the 2026 World Cup. Fifa is facing legal action from player unions and leagues about the scheduling of the event, which begins two weeks after the Champions League final at the end of the 2024-25 European calendar and ends five weeks before the first Premier League match of the 2025-2026 season. But football's world governing body believes the dates allow sufficient rest time before the start of the domestic campaigns. The Club World Cup will now take place once every four years, when it was previously held annually and involved just seven teams. Streaming platform DAZN has secured exclusive rights to broadcast next summer's tournament, during which 63 matches will take place over 29 days.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.5792, Text: After Fifa awards Saudi Arabia the hosting rights for the men's 2034 World Cup, BBC analysis editor Ros Atkins looks at how we got here and the controversies surrounding the decision.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.5877, Text: FA still to decide on endorsing Saudi World Cup bid\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80) # Add separator line\n", - " for doc, score in search_results:\n", - " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80) # Add separator between results\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: To create a COMPOSITE index, the below code can be used.\n", - "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"claude_composite_index\", index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sS0FebHI9U1l" - }, - "source": [ - "# Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": { - "id": "ZGUXQQmv9ge4" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-09 12:26:10,540 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "system_template = \"You are a helpful assistant that answers questions based on the provided context.\"\n", - "system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)\n", - "\n", - "human_template = \"Context: {context}\\n\\nQuestion: {question}\"\n", - "human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)\n", - "\n", - "chat_prompt = ChatPromptTemplate.from_messages([\n", - " system_message_prompt,\n", - " human_message_prompt\n", - "])\n", - "\n", - "def format_docs(docs):\n", - " return \"\\n\\n\".join(doc.page_content for doc in docs)\n", - "\n", - "rag_chain = (\n", - " {\"context\": lambda x: format_docs(vector_store.similarity_search(x)), \"question\": RunnablePassthrough()}\n", - " | chat_prompt\n", - " | llm\n", - ")\n", - "logging.info(\"Successfully created RAG chain\")" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": { - "id": "Mia7XxM9978M" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: During the draw for the 2026 FIFA World Cup, a map was shown that excluded Crimea as part of Ukraine. This graphic, which was displaying countries that cannot be drawn to play each other for geopolitical reasons, highlighted Ukraine but did not include the Crimean peninsula, which is internationally recognized as Ukrainian territory.\n", - "\n", - "This omission sparked significant controversy because Crimea has been under Russian occupation since 2014, but only a handful of countries recognize it as Russian territory. The Ukrainian Foreign Ministry spokesman, Heorhiy Tykhy, called this an \"unacceptable error\" and stated that Ukraine expected \"a public apology\" from FIFA. He criticized FIFA for acting \"against international law\" and supporting \"Russian propaganda, war crimes, and the crime of aggression against Ukraine.\"\n", - "\n", - "The Ukrainian Football Association also sent a formal letter of complaint to FIFA and UEFA officials expressing their \"deep concern\" about the cartographic representation. FIFA acknowledged they were \"aware of an issue\" and subsequently removed the image.\n", - "RAG response generated in 8.68 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " start_time = time.time()\n", - " rag_response = rag_chain.invoke(query)\n", - " rag_elapsed_time = time.time() - start_time\n", - "\n", - " print(f\"RAG Response: {rag_response.content}\")\n", - " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", - "except AuthenticationError as e:\n", - " print(f\"Authentication error: {str(e)}\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aIdayPzw9glT" - }, - "source": [ - "# Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": { - "id": "0xM2G3ef-GS2" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\n", - "Response: According to the context, Apple Intelligence (an AI feature that summarizes notifications) generated a false headline that made it appear as if BBC News had published an article claiming Luigi Mangione, who was arrested for the murder of healthcare insurance CEO Brian Thompson in New York, had shot himself. This was completely false - Mangione had not shot himself.\n", - "\n", - "The BBC complained to Apple about this misrepresentation, with a BBC spokesperson stating they had \"contacted Apple to raise this concern and fix the problem.\" The spokesperson emphasized that it's \"essential\" that audiences can trust information published under the BBC name, including notifications.\n", - "\n", - "This wasn't an isolated incident, as the context mentions that Apple's AI feature also misrepresented a New York Times article, incorrectly summarizing it as \"Netanyahu arrested\" when the actual article was about the International Criminal Court issuing an arrest warrant for the Israeli prime minister.\n", - "Time taken: 6.22 seconds\n", - "\n", - "Query 2: What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\n", - "Response: During the draw for the 2026 FIFA World Cup, a map was shown that excluded Crimea as part of Ukraine. This graphic, which was displaying countries that cannot be drawn to play each other for geopolitical reasons, highlighted Ukraine but did not include the Crimean peninsula, which is internationally recognized as Ukrainian territory.\n", - "\n", - "This omission sparked significant controversy because Crimea has been under Russian occupation since 2014, but only a handful of countries recognize it as Russian territory. The Ukrainian Foreign Ministry spokesman, Heorhiy Tykhy, called this an \"unacceptable error\" and stated that Ukraine expected \"a public apology\" from FIFA. He criticized FIFA for acting \"against international law\" and supporting \"Russian propaganda, war crimes, and the crime of aggression against Ukraine.\"\n", - "\n", - "The Ukrainian Football Association also sent a formal letter of complaint to FIFA and UEFA officials expressing their \"deep concern\" about the cartographic representation. FIFA acknowledged they were \"aware of an issue\" and subsequently removed the image.\n", - "Time taken: 0.47 seconds\n", - "\n", - "Query 3: What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\n", - "Response: According to the context, Apple Intelligence (an AI feature that summarizes notifications) generated a false headline that made it appear as if BBC News had published an article claiming Luigi Mangione, who was arrested for the murder of healthcare insurance CEO Brian Thompson in New York, had shot himself. This was completely false - Mangione had not shot himself.\n", - "\n", - "The BBC complained to Apple about this misrepresentation, with a BBC spokesperson stating they had \"contacted Apple to raise this concern and fix the problem.\" The spokesperson emphasized that it's \"essential\" that audiences can trust information published under the BBC name, including notifications.\n", - "\n", - "This wasn't an isolated incident, as the context mentions that Apple's AI feature also misrepresented a New York Times article, incorrectly summarizing it as \"Netanyahu arrested\" when the actual article was about the International Criminal Court issuing an arrest warrant for the Israeli prime minister.\n", - "Time taken: 0.46 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\",\n", - " \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\", # Repeated query\n", - " \"What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - "\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response.content}\")\n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "except AuthenticationError as e:\n", - " print(f\"Authentication error: {str(e)}\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yJQ5P8E29go1" - }, - "source": [ - "## Conclusion\n", - "By following these steps, you’ll have a fully functional semantic search engine that leverages the strengths of Couchbase and Claude(by Anthropic). This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using GSI which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "colab": { - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/claudeai/gsi/.env.sample b/claudeai/query_based/.env.sample similarity index 100% rename from claudeai/gsi/.env.sample rename to claudeai/query_based/.env.sample diff --git a/claudeai/query_based/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb b/claudeai/query_based/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb new file mode 100644 index 00000000..cc7b0e6f --- /dev/null +++ b/claudeai/query_based/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb @@ -0,0 +1,1089 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "kNdImxzypDlm" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com/) as the AI-powered embedding and [Anthropic](https://claude.ai/) as the language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Hyperscale and Composite Vector Indexes from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Search Vector Index, please take a look at [this.](https://developer.couchbase.com/tutorial-openai-claude-couchbase-rag-with-search-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/claudeai/gsi/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for OpenAI and Anthropic\n", + "\n", + "* Please follow the [instructions](https://platform.openai.com/docs/quickstart) to generate the OpenAI credentials.\n", + "* Please follow the [instructions](https://docs.anthropic.com/en/api/getting-started) to generate the Anthropic credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", + "\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NH2o6pqa69oG" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and Claude(by Anthropic) for understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DYhPj0Ta8l_A" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-anthropic==0.3.19 langchain-openai==0.3.32 python-dotenv==1.1.1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1pp7GtNg8mB9" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "8GzS6tfL8mFP" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "from multiprocessing import AuthenticationError\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + " ServiceUnavailableException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_anthropic import ChatAnthropic\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.prompts.chat import (ChatPromptTemplate,\n", + " HumanMessagePromptTemplate,\n", + " SystemMessagePromptTemplate)\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_couchbase.vectorstores import IndexType" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pBnMp5vb8mIb" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "Yv8kWcuf8mLx" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", + "\n", + "# Disable all logging except critical to prevent OpenAI API request logs\n", + "logging.getLogger(\"httpx\").setLevel(logging.CRITICAL)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K9G5a0en8mPA" + }, + "source": [ + "# Loading Sensitive Informnation\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "PFGyHll18mSe" + }, + "outputs": [], + "source": [ + "load_dotenv()\n", + "\n", + "# Load from environment variables or prompt for input in one-liners\n", + "ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY') or getpass.getpass('Enter your Anthropic API key: ')\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API key: ')\n", + "CB_HOST = os.getenv('CB_HOST', 'couchbase://localhost') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME', 'Administrator') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD', 'password') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME', 'query-vector-search-testing') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME', 'shared') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'claude') or input('Enter your collection name (default: claude): ') or 'claude'\n", + "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION', 'cache') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", + "# Check if the variables are correctly loaded\n", + "if not ANTHROPIC_API_KEY:\n", + " raise ValueError(\"ANTHROPIC_API_KEY is not set in the environment.\")\n", + "if not OPENAI_API_KEY:\n", + " raise ValueError(\"OPENAI_API_KEY is not set in the environment.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtGrYzUY8mV3" + }, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "Zb3kK-7W8mZK" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:15:22,899 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_Gpy32N8mcZ" + }, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "ACZcwUnG8mf2" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:15:26,795 - INFO - Bucket 'query-vector-search-testing' exists.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:15:26,808 - INFO - Collection 'claude' does not exist. Creating it...\n", + "2025-09-09 12:15:26,854 - INFO - Collection 'claude' created successfully.\n", + "2025-09-09 12:15:29,065 - INFO - All documents cleared from the collection.\n", + "2025-09-09 12:15:29,066 - INFO - Bucket 'query-vector-search-testing' exists.\n", + "2025-09-09 12:15:29,074 - INFO - Collection 'cache' already exists. Skipping creation.\n", + "2025-09-09 12:15:31,115 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7FvxRsg38m3G" + }, + "source": [ + "# Creating OpenAI Embeddings\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "_75ZyCRh8m6m" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:15:54,388 - INFO - Successfully created OpenAIEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model='text-embedding-3-small')\n", + " logging.info(\"Successfully created OpenAIEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IwZMUnF8m-N" + }, + "source": [ + "# Setting Up the Couchbase Query Vector Store\n", + "A vector store is where we'll keep our embeddings. The query vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, GSI converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables us to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used.\n", + "\n", + "The vector store requires a distance metric to determine how similarity between vectors is calculated. This is crucial for accurate semantic search results as different distance metrics can yield different similarity rankings. Some of the supported Distance strategies are dot, l2, euclidean, cosine, l2_squared, euclidean_squared. In our implementation we will use cosine which is particularly effective for text embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "DwIJQjYT9RV_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:16:02,578 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding = embeddings,\n", + " distance_metric=DistanceStrategy.COSINE\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:16:16,461 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "3. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 100 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:18:40,320 - INFO - Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 100\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Pn8-dQw9RfQ" + }, + "source": [ + "# Setting Up a Couchbase Cache\n", + "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", + "\n", + "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "V2y7dyjf9Rid" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:18:47,269 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uehAx36o9Rlm" + }, + "source": [ + "# Using the Claude 4 Sonnet Language Model (LLM)\n", + "Language models are AI systems that are trained to understand and generate human language. We'll be using the `Claude 4 Sonnet` language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", + "\n", + "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yRAfBRLH9RpO" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:20:36,212 - INFO - Successfully created ChatAnthropic\n" + ] + } + ], + "source": [ + "try:\n", + " llm = ChatAnthropic(temperature=0.1, anthropic_api_key=ANTHROPIC_API_KEY, model_name='claude-sonnet-4-20250514') \n", + " logging.info(\"Successfully created ChatAnthropic\")\n", + "except Exception as e:\n", + " logging.error(f\"Error creating ChatAnthropic: {str(e)}. Please check your API key and network connection.\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k_XDfCx19UvG" + }, + "source": [ + "# Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "Pk-oFbnC9Uym" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:21:34,292 - INFO - Semantic search completed in 1.91 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 1.91 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.2502, Text: A map shown during the draw for the 2026 Fifa World Cup has been criticised by Ukraine as an \"unacceptable error\" after it appeared to exclude Crimea as part of the country. The graphic - showing countries that cannot be drawn to play each other for geopolitical reasons - highlighted Ukraine but did not include the peninsula that is internationally recognised to be part of it. Crimea has been under Russian occupation since 2014 and just a handful of countries recognise the peninsula as Russian territory. Ukraine Foreign Ministry spokesman Heorhiy Tykhy said that the nation expects \"a public apology\". Fifa said it was \"aware of an issue\" and the image had been removed.\n", + "\n", + "Writing on X, Tykhy said that Fifa had not only \"acted against international law\" but had also \"supported Russian propaganda, war crimes, and the crime of aggression against Ukraine\". He added a \"fixed\" version of the map to his post, highlighting Crimea as part of Ukraine's territory. Among the countries that cannot play each other are Ukraine and Belarus, Spain and Gibraltar and Kosovo versus either Bosnia and Herzegovina or Serbia.\n", + "\n", + "This Twitter post cannot be displayed in your browser. Please enable Javascript or try a different browser. View original content on Twitter The BBC is not responsible for the content of external sites. Skip twitter post by Heorhii Tykhyi This article contains content provided by Twitter. We ask for your permission before anything is loaded, as they may be using cookies and other technologies. You may want to read Twitter\u2019s cookie policy, external and privacy policy, external before accepting. To view this content choose \u2018accept and continue\u2019. The BBC is not responsible for the content of external sites.\n", + "\n", + "The Ukrainian Football Association has also sent a letter to Fifa secretary-general Mathias Grafstr\u00f6m and UEFA secretary-general Theodore Theodoridis over the matter. \"We appeal to you to express our deep concern about the infographic map [shown] on December 13, 2024,\" the letter reads. \"Taking into account a number of official decisions and resolutions adopted by the Fifa Council and the UEFA executive committee since 2014... we emphasize that today's version of the cartographic image of Ukraine... is completely unacceptable and looks like an inconsistent position of Fifa and UEFA.\" The 2026 World Cup will start on 11 June that year in Mexico City and end on 19 July in New Jersey. The expanded 48-team tournament will last a record 39 days. Ukraine were placed in Group D alongside Iceland, Azerbaijan and the yet-to-be-determined winners of France's Nations League quarter-final against Croatia.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.5698, Text: Defending champions Manchester City will face Juventus in the group stage of the Fifa Club World Cup next summer, while Chelsea meet Brazilian side Flamengo. Pep Guardiola's City, who beat Brazilian side Fluminense to win the tournament for the first time in 2023, begin their title defence against Morocco's Wydad and also play Al Ain of the United Arab Emirates in Group G. Chelsea, winners of the 2021 final, were also drawn alongside Mexico's Club Leon and Tunisian side Esperance Sportive de Tunisie in Group D. The revamped Fifa Club World Cup, which has been expanded to 32 teams, will take place in the United States between 15 June and 13 July next year.\n", + "\n", + "A complex and lengthy draw ceremony was held across two separate Miami locations and lasted more than 90 minutes, during which a new Club World Cup trophy was revealed. There was also a video message from incoming US president Donald Trump, whose daughter Ivanka drew the first team. Lionel Messi's Inter Miami will take on Egyptian side Al Ahly at the Hard Rock Stadium in the opening match, staged in Miami. Elsewhere, Paris St-Germain were drawn against Atletico Madrid in Group B, while Bayern Munich meet Benfica in another all-European group-stage match-up. Teams will play each other once in the group phase and the top two will progress to the knockout stage.\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. What is the Club World Cup?\n", + "\n", + "Teams from each of the six international football confederations will be represented at next summer's tournament, including 12 European clubs - the highest quota of any confederation. The European places were decided by clubs' Champions League performances over the past four seasons, with recent winners Chelsea, Manchester City and Real Madrid guaranteed places. Al Ain, the most successful club in the UAE with 14 league titles, are owned by the country's president Sheikh Mohamed bin Zayed Al Nahyan - the older brother of City owner Sheikh Mansour. Real, who lifted the Fifa Club World Cup trophy for a record-extending fifth time in 2022, will open up against Saudi Pro League champions Al-Hilal, who currently have Neymar in their ranks. One place was reserved for a club from the host nation, which Fifa controversially awarded to Inter Miami, who will contest the tournament curtain-raiser. Messi's side were winners of the regular-season MLS Supporters' Shield but beaten in the MLS play-offs, meaning they are not this season's champions.\n", + "\u2022 None How does the new Club World Cup work & why is it so controversial?\n", + "\n", + "Matches will be played across 12 venues in the US which, alongside Canada and Mexico, also host the 2026 World Cup. Fifa is facing legal action from player unions and leagues about the scheduling of the event, which begins two weeks after the Champions League final at the end of the 2024-25 European calendar and ends five weeks before the first Premier League match of the 2025-2026 season. But football's world governing body believes the dates allow sufficient rest time before the start of the domestic campaigns. The Club World Cup will now take place once every four years, when it was previously held annually and involved just seven teams. Streaming platform DAZN has secured exclusive rights to broadcast next summer's tournament, during which 63 matches will take place over 29 days.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.5792, Text: After Fifa awards Saudi Arabia the hosting rights for the men's 2034 World Cup, BBC analysis editor Ros Atkins looks at how we got here and the controversies surrounding the decision.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.5877, Text: FA still to decide on endorsing Saudi World Cup bid\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80) # Add separator line\n", + " for doc, score in search_results:\n", + " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80) # Add separator between results\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", + "\n", + "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Couchbase Hyperscale and Composite Vector Indexes in Couchbase.\n", + "\n", + "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", + "\n", + "Hyperscale Vector Indexes (BHIVE)\n", + "- Best for pure vector searches - content discovery, recommendations, semantic search\n", + "- High performance with low memory footprint - designed to scale to billions of vectors\n", + "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", + "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", + "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", + "\n", + "Composite Vector Indexes \n", + "- Best for filtered vector searches - combines vector search with scalar value filtering\n", + "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", + "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", + "\n", + "Choosing the Right Index Type\n", + "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", + "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", + "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", + "\n", + "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", + "\n", + "\n", + "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", + "\n", + "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", + "\n", + "Format: `'IVF[],{PQ|SQ}'`\n", + "\n", + "Centroids (IVF - Inverted File):\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- More centroids = faster search, slower training \n", + "- Fewer centroids = slower search, faster training\n", + "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "\n", + "Quantization Options:\n", + "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- Higher values = better accuracy, larger index size\n", + "\n", + "Common Examples:\n", + "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", + "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", + "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, Hyperscale and Composite Vector indexes can be created manually from the Couchbase UI." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"claude_bhive_index\",index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:26:01,504 - INFO - Semantic search completed in 0.44 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 0.44 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.2502, Text: A map shown during the draw for the 2026 Fifa World Cup has been criticised by Ukraine as an \"unacceptable error\" after it appeared to exclude Crimea as part of the country. The graphic - showing countries that cannot be drawn to play each other for geopolitical reasons - highlighted Ukraine but did not include the peninsula that is internationally recognised to be part of it. Crimea has been under Russian occupation since 2014 and just a handful of countries recognise the peninsula as Russian territory. Ukraine Foreign Ministry spokesman Heorhiy Tykhy said that the nation expects \"a public apology\". Fifa said it was \"aware of an issue\" and the image had been removed.\n", + "\n", + "Writing on X, Tykhy said that Fifa had not only \"acted against international law\" but had also \"supported Russian propaganda, war crimes, and the crime of aggression against Ukraine\". He added a \"fixed\" version of the map to his post, highlighting Crimea as part of Ukraine's territory. Among the countries that cannot play each other are Ukraine and Belarus, Spain and Gibraltar and Kosovo versus either Bosnia and Herzegovina or Serbia.\n", + "\n", + "This Twitter post cannot be displayed in your browser. Please enable Javascript or try a different browser. View original content on Twitter The BBC is not responsible for the content of external sites. Skip twitter post by Heorhii Tykhyi This article contains content provided by Twitter. We ask for your permission before anything is loaded, as they may be using cookies and other technologies. You may want to read Twitter\u2019s cookie policy, external and privacy policy, external before accepting. To view this content choose \u2018accept and continue\u2019. The BBC is not responsible for the content of external sites.\n", + "\n", + "The Ukrainian Football Association has also sent a letter to Fifa secretary-general Mathias Grafstr\u00f6m and UEFA secretary-general Theodore Theodoridis over the matter. \"We appeal to you to express our deep concern about the infographic map [shown] on December 13, 2024,\" the letter reads. \"Taking into account a number of official decisions and resolutions adopted by the Fifa Council and the UEFA executive committee since 2014... we emphasize that today's version of the cartographic image of Ukraine... is completely unacceptable and looks like an inconsistent position of Fifa and UEFA.\" The 2026 World Cup will start on 11 June that year in Mexico City and end on 19 July in New Jersey. The expanded 48-team tournament will last a record 39 days. Ukraine were placed in Group D alongside Iceland, Azerbaijan and the yet-to-be-determined winners of France's Nations League quarter-final against Croatia.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.5698, Text: Defending champions Manchester City will face Juventus in the group stage of the Fifa Club World Cup next summer, while Chelsea meet Brazilian side Flamengo. Pep Guardiola's City, who beat Brazilian side Fluminense to win the tournament for the first time in 2023, begin their title defence against Morocco's Wydad and also play Al Ain of the United Arab Emirates in Group G. Chelsea, winners of the 2021 final, were also drawn alongside Mexico's Club Leon and Tunisian side Esperance Sportive de Tunisie in Group D. The revamped Fifa Club World Cup, which has been expanded to 32 teams, will take place in the United States between 15 June and 13 July next year.\n", + "\n", + "A complex and lengthy draw ceremony was held across two separate Miami locations and lasted more than 90 minutes, during which a new Club World Cup trophy was revealed. There was also a video message from incoming US president Donald Trump, whose daughter Ivanka drew the first team. Lionel Messi's Inter Miami will take on Egyptian side Al Ahly at the Hard Rock Stadium in the opening match, staged in Miami. Elsewhere, Paris St-Germain were drawn against Atletico Madrid in Group B, while Bayern Munich meet Benfica in another all-European group-stage match-up. Teams will play each other once in the group phase and the top two will progress to the knockout stage.\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. What is the Club World Cup?\n", + "\n", + "Teams from each of the six international football confederations will be represented at next summer's tournament, including 12 European clubs - the highest quota of any confederation. The European places were decided by clubs' Champions League performances over the past four seasons, with recent winners Chelsea, Manchester City and Real Madrid guaranteed places. Al Ain, the most successful club in the UAE with 14 league titles, are owned by the country's president Sheikh Mohamed bin Zayed Al Nahyan - the older brother of City owner Sheikh Mansour. Real, who lifted the Fifa Club World Cup trophy for a record-extending fifth time in 2022, will open up against Saudi Pro League champions Al-Hilal, who currently have Neymar in their ranks. One place was reserved for a club from the host nation, which Fifa controversially awarded to Inter Miami, who will contest the tournament curtain-raiser. Messi's side were winners of the regular-season MLS Supporters' Shield but beaten in the MLS play-offs, meaning they are not this season's champions.\n", + "\u2022 None How does the new Club World Cup work & why is it so controversial?\n", + "\n", + "Matches will be played across 12 venues in the US which, alongside Canada and Mexico, also host the 2026 World Cup. Fifa is facing legal action from player unions and leagues about the scheduling of the event, which begins two weeks after the Champions League final at the end of the 2024-25 European calendar and ends five weeks before the first Premier League match of the 2025-2026 season. But football's world governing body believes the dates allow sufficient rest time before the start of the domestic campaigns. The Club World Cup will now take place once every four years, when it was previously held annually and involved just seven teams. Streaming platform DAZN has secured exclusive rights to broadcast next summer's tournament, during which 63 matches will take place over 29 days.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.5792, Text: After Fifa awards Saudi Arabia the hosting rights for the men's 2034 World Cup, BBC analysis editor Ros Atkins looks at how we got here and the controversies surrounding the decision.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.5877, Text: FA still to decide on endorsing Saudi World Cup bid\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80) # Add separator line\n", + " for doc, score in search_results:\n", + " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80) # Add separator between results\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: To create a COMPOSITE index, the below code can be used.\n", + "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"claude_composite_index\", index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sS0FebHI9U1l" + }, + "source": [ + "# Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "ZGUXQQmv9ge4" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-09 12:26:10,540 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "system_template = \"You are a helpful assistant that answers questions based on the provided context.\"\n", + "system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)\n", + "\n", + "human_template = \"Context: {context}\\n\\nQuestion: {question}\"\n", + "human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)\n", + "\n", + "chat_prompt = ChatPromptTemplate.from_messages([\n", + " system_message_prompt,\n", + " human_message_prompt\n", + "])\n", + "\n", + "def format_docs(docs):\n", + " return \"\\n\\n\".join(doc.page_content for doc in docs)\n", + "\n", + "rag_chain = (\n", + " {\"context\": lambda x: format_docs(vector_store.similarity_search(x)), \"question\": RunnablePassthrough()}\n", + " | chat_prompt\n", + " | llm\n", + ")\n", + "logging.info(\"Successfully created RAG chain\")" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "Mia7XxM9978M" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: During the draw for the 2026 FIFA World Cup, a map was shown that excluded Crimea as part of Ukraine. This graphic, which was displaying countries that cannot be drawn to play each other for geopolitical reasons, highlighted Ukraine but did not include the Crimean peninsula, which is internationally recognized as Ukrainian territory.\n", + "\n", + "This omission sparked significant controversy because Crimea has been under Russian occupation since 2014, but only a handful of countries recognize it as Russian territory. The Ukrainian Foreign Ministry spokesman, Heorhiy Tykhy, called this an \"unacceptable error\" and stated that Ukraine expected \"a public apology\" from FIFA. He criticized FIFA for acting \"against international law\" and supporting \"Russian propaganda, war crimes, and the crime of aggression against Ukraine.\"\n", + "\n", + "The Ukrainian Football Association also sent a formal letter of complaint to FIFA and UEFA officials expressing their \"deep concern\" about the cartographic representation. FIFA acknowledged they were \"aware of an issue\" and subsequently removed the image.\n", + "RAG response generated in 8.68 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " start_time = time.time()\n", + " rag_response = rag_chain.invoke(query)\n", + " rag_elapsed_time = time.time() - start_time\n", + "\n", + " print(f\"RAG Response: {rag_response.content}\")\n", + " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", + "except AuthenticationError as e:\n", + " print(f\"Authentication error: {str(e)}\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aIdayPzw9glT" + }, + "source": [ + "# Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "0xM2G3ef-GS2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\n", + "Response: According to the context, Apple Intelligence (an AI feature that summarizes notifications) generated a false headline that made it appear as if BBC News had published an article claiming Luigi Mangione, who was arrested for the murder of healthcare insurance CEO Brian Thompson in New York, had shot himself. This was completely false - Mangione had not shot himself.\n", + "\n", + "The BBC complained to Apple about this misrepresentation, with a BBC spokesperson stating they had \"contacted Apple to raise this concern and fix the problem.\" The spokesperson emphasized that it's \"essential\" that audiences can trust information published under the BBC name, including notifications.\n", + "\n", + "This wasn't an isolated incident, as the context mentions that Apple's AI feature also misrepresented a New York Times article, incorrectly summarizing it as \"Netanyahu arrested\" when the actual article was about the International Criminal Court issuing an arrest warrant for the Israeli prime minister.\n", + "Time taken: 6.22 seconds\n", + "\n", + "Query 2: What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\n", + "Response: During the draw for the 2026 FIFA World Cup, a map was shown that excluded Crimea as part of Ukraine. This graphic, which was displaying countries that cannot be drawn to play each other for geopolitical reasons, highlighted Ukraine but did not include the Crimean peninsula, which is internationally recognized as Ukrainian territory.\n", + "\n", + "This omission sparked significant controversy because Crimea has been under Russian occupation since 2014, but only a handful of countries recognize it as Russian territory. The Ukrainian Foreign Ministry spokesman, Heorhiy Tykhy, called this an \"unacceptable error\" and stated that Ukraine expected \"a public apology\" from FIFA. He criticized FIFA for acting \"against international law\" and supporting \"Russian propaganda, war crimes, and the crime of aggression against Ukraine.\"\n", + "\n", + "The Ukrainian Football Association also sent a formal letter of complaint to FIFA and UEFA officials expressing their \"deep concern\" about the cartographic representation. FIFA acknowledged they were \"aware of an issue\" and subsequently removed the image.\n", + "Time taken: 0.47 seconds\n", + "\n", + "Query 3: What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\n", + "Response: According to the context, Apple Intelligence (an AI feature that summarizes notifications) generated a false headline that made it appear as if BBC News had published an article claiming Luigi Mangione, who was arrested for the murder of healthcare insurance CEO Brian Thompson in New York, had shot himself. This was completely false - Mangione had not shot himself.\n", + "\n", + "The BBC complained to Apple about this misrepresentation, with a BBC spokesperson stating they had \"contacted Apple to raise this concern and fix the problem.\" The spokesperson emphasized that it's \"essential\" that audiences can trust information published under the BBC name, including notifications.\n", + "\n", + "This wasn't an isolated incident, as the context mentions that Apple's AI feature also misrepresented a New York Times article, incorrectly summarizing it as \"Netanyahu arrested\" when the actual article was about the International Criminal Court issuing an arrest warrant for the Israeli prime minister.\n", + "Time taken: 0.46 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\",\n", + " \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\", # Repeated query\n", + " \"What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + "\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response.content}\")\n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "except AuthenticationError as e:\n", + " print(f\"Authentication error: {str(e)}\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJQ5P8E29go1" + }, + "source": [ + "## Conclusion\n", + "By following these steps, you\u2019ll have a fully functional semantic search engine that leverages the strengths of Couchbase and Claude(by Anthropic). This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using Hyperscale and Composite Vector Indexes which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/claudeai/gsi/frontmatter.md b/claudeai/query_based/frontmatter.md similarity index 100% rename from claudeai/gsi/frontmatter.md rename to claudeai/query_based/frontmatter.md diff --git a/claudeai/fts/.env.sample b/claudeai/search_based/.env.sample similarity index 100% rename from claudeai/fts/.env.sample rename to claudeai/search_based/.env.sample diff --git a/claudeai/search_based/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb b/claudeai/search_based/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb new file mode 100644 index 00000000..854a646a --- /dev/null +++ b/claudeai/search_based/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb @@ -0,0 +1,1049 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "kNdImxzypDlm" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com/) as the AI-powered embedding and [Anthropic](https://claude.ai/) as the language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Search Vector Index from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-openai-claude-couchbase-rag-with-hyperscale-or-composite-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/claudeai/fts/RAG_with_Couchbase_and_Claude(by_Anthropic).ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for OpenAI and Anthropic\n", + "\n", + "* Please follow the [instructions](https://platform.openai.com/docs/quickstart) to generate the OpenAI credentials.\n", + "* Please follow the [instructions](https://docs.anthropic.com/en/api/getting-started) to generate the Anthropic credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", + "\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NH2o6pqa69oG" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and Claude(by Anthropic) for understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "DYhPj0Ta8l_A" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.3.0 langchain-anthropic==0.3.11 langchain-openai==0.3.13 python-dotenv==1.1.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1pp7GtNg8mB9" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "8GzS6tfL8mFP" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "from multiprocessing import AuthenticationError\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + " ServiceUnavailableException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_anthropic import ChatAnthropic\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.prompts.chat import (ChatPromptTemplate,\n", + " HumanMessagePromptTemplate,\n", + " SystemMessagePromptTemplate)\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", + "from langchain_openai import OpenAIEmbeddings" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pBnMp5vb8mIb" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "Yv8kWcuf8mLx" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", + "\n", + "# Disable all logging except critical to prevent OpenAI API request logs\n", + "logging.getLogger(\"httpx\").setLevel(logging.CRITICAL)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K9G5a0en8mPA" + }, + "source": [ + "# Loading Sensitive Informnation\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "PFGyHll18mSe" + }, + "outputs": [], + "source": [ + "load_dotenv()\n", + "\n", + "# Load from environment variables or prompt for input in one-liners\n", + "ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY') or getpass.getpass('Enter your Anthropic API key: ')\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API key: ')\n", + "CB_HOST = os.getenv('CB_HOST', 'couchbase://localhost') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME', 'Administrator') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD', 'password') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME', 'vector-search-testing') or input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", + "INDEX_NAME = os.getenv('INDEX_NAME', 'vector_search_claude') or input('Enter your index name (default: vector_search_claude): ') or 'vector_search_claude'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME', 'shared') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'claude') or input('Enter your collection name (default: claude): ') or 'claude'\n", + "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION', 'cache') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", + "# Check if the variables are correctly loaded\n", + "if not ANTHROPIC_API_KEY:\n", + " raise ValueError(\"ANTHROPIC_API_KEY is not set in the environment.\")\n", + "if not OPENAI_API_KEY:\n", + " raise ValueError(\"OPENAI_API_KEY is not set in the environment.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtGrYzUY8mV3" + }, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "Zb3kK-7W8mZK" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:48:21,579 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_Gpy32N8mcZ" + }, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Creates primary index on collection for query performance\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "ACZcwUnG8mf2" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:48:28,237 - INFO - Bucket 'vector-search-testing' does not exist. Creating it...\n", + "2025-02-25 21:48:28,800 - INFO - Bucket 'vector-search-testing' created successfully.\n", + "2025-02-25 21:48:28,802 - INFO - Scope 'shared' does not exist. Creating it...\n", + "2025-02-25 21:48:28,851 - INFO - Scope 'shared' created successfully.\n", + "2025-02-25 21:48:28,855 - INFO - Collection 'claude' does not exist. Creating it...\n", + "2025-02-25 21:48:28,943 - INFO - Collection 'claude' created successfully.\n", + "2025-02-25 21:48:32,802 - INFO - Primary index present or created successfully.\n", + "2025-02-25 21:48:41,954 - INFO - All documents cleared from the collection.\n", + "2025-02-25 21:48:41,955 - INFO - Bucket 'vector-search-testing' exists.\n", + "2025-02-25 21:48:41,959 - INFO - Collection 'cache' does not exist. Creating it...\n", + "2025-02-25 21:48:42,003 - INFO - Collection 'cache' created successfully.\n", + "2025-02-25 21:48:46,902 - INFO - Primary index present or created successfully.\n", + "2025-02-25 21:48:46,904 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Ensure primary index exists\n", + " try:\n", + " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", + " logging.info(\"Primary index present or created successfully.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error creating primary index: {str(e)}\")\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NMJ7RRYp8mjV" + }, + "source": [ + "# Loading Couchbase Vector Search Index\n", + "\n", + "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", + "\n", + "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n", + "\n", + "> Note: Index creation will not fail if used with the concerned bucket(vector-search-testing) instead of travel-sample\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "y7xiCrOc8mmj" + }, + "outputs": [], + "source": [ + "# If you are running this script locally (not in Google Colab), uncomment the following line\n", + "# and provide the path to your index definition file.\n", + "\n", + "# index_definition_path = '/path_to_your_index_file/claude_index.json' # Local setup: specify your file path here\n", + "\n", + "# # Version for Google Colab\n", + "# def load_index_definition_colab():\n", + "# from google.colab import files\n", + "# print(\"Upload your index definition file\")\n", + "# uploaded = files.upload()\n", + "# index_definition_path = list(uploaded.keys())[0]\n", + "\n", + "# try:\n", + "# with open(index_definition_path, 'r') as file:\n", + "# index_definition = json.load(file)\n", + "# return index_definition\n", + "# except Exception as e:\n", + "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Version for Local Environment\n", + "def load_index_definition_local(index_definition_path):\n", + " try:\n", + " with open(index_definition_path, 'r') as file:\n", + " index_definition = json.load(file)\n", + " return index_definition\n", + " except Exception as e:\n", + " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Usage\n", + "# Uncomment the appropriate line based on your environment\n", + "# index_definition = load_index_definition_colab()\n", + "index_definition = load_index_definition_local('claude_index.json')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v_ddPQ_Y8mpm" + }, + "source": [ + "# Creating or Updating Search Indexes\n", + "\n", + "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "bHEpUu1l8msx" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:48:52,980 - INFO - Creating new index 'vector_search_claude'...\n", + "2025-02-25 21:48:53,069 - INFO - Index 'vector_search_claude' successfully created/updated.\n" + ] + } + ], + "source": [ + "try:\n", + " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", + "\n", + " # Check if index already exists\n", + " existing_indexes = scope_index_manager.get_all_indexes()\n", + " index_name = index_definition[\"name\"]\n", + "\n", + " if index_name in [index.name for index in existing_indexes]:\n", + " logging.info(f\"Index '{index_name}' found\")\n", + " else:\n", + " logging.info(f\"Creating new index '{index_name}'...\")\n", + "\n", + " # Create SearchIndex object from JSON definition\n", + " search_index = SearchIndex.from_json(index_definition)\n", + "\n", + " # Upsert the index (create if not exists, update if exists)\n", + " scope_index_manager.upsert_index(search_index)\n", + " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", + "\n", + "except QueryIndexAlreadyExistsException:\n", + " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", + "except ServiceUnavailableException:\n", + " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", + "except InternalServerFailureException as e:\n", + " logging.error(f\"Internal server error: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7FvxRsg38m3G" + }, + "source": [ + "# Creating OpenAI Embeddings\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "_75ZyCRh8m6m" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:48:56,274 - INFO - Successfully created OpenAIEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model='text-embedding-3-small')\n", + " logging.info(\"Successfully created OpenAIEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IwZMUnF8m-N" + }, + "source": [ + "# Setting Up the Couchbase Vector Store\n", + "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "DwIJQjYT9RV_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:48:59,450 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " index_name=INDEX_NAME,\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:49:09,255 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "3. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 100 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:50:15,064 - INFO - Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 100\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Pn8-dQw9RfQ" + }, + "source": [ + "# Setting Up a Couchbase Cache\n", + "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", + "\n", + "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "V2y7dyjf9Rid" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:50:48,836 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uehAx36o9Rlm" + }, + "source": [ + "# Using the Claude 4 Sonnet Language Model (LLM)\n", + "Language models are AI systems that are trained to understand and generate human language. We'll be using the `Claude 4 Sonnet` language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", + "\n", + "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yRAfBRLH9RpO" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:50:52,173 - INFO - Successfully created ChatAnthropic\n" + ] + } + ], + "source": [ + "try:\n", + " llm = ChatAnthropic(temperature=0.1, anthropic_api_key=ANTHROPIC_API_KEY, model_name='claude-sonnet-4-20250514') \n", + " logging.info(\"Successfully created ChatAnthropic\")\n", + "except Exception as e:\n", + " logging.error(f\"Error creating ChatAnthropic: {str(e)}. Please check your API key and network connection.\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k_XDfCx19UvG" + }, + "source": [ + "# Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. \n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the similarity_search_with_score method of the CouchbaseSearchVectorStore. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "Pk-oFbnC9Uym" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:53:55,462 - INFO - Semantic search completed in 0.55 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 0.55 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.7498, Text: A map shown during the draw for the 2026 Fifa World Cup has been criticised by Ukraine as an \"unacceptable error\" after it appeared to exclude Crimea as part of the country. The graphic - showing countries that cannot be drawn to play each other for geopolitical reasons - highlighted Ukraine but did not include the peninsula that is internationally recognised to be part of it. Crimea has been under Russian occupation since 2014 and just a handful of countries recognise the peninsula as Russian territory. Ukraine Foreign Ministry spokesman Heorhiy Tykhy said that the nation expects \"a public apology\". Fifa said it was \"aware of an issue\" and the image had been removed.\n", + "\n", + "Writing on X, Tykhy said that Fifa had not only \"acted against international law\" but had also \"supported Russian propaganda, war crimes, and the crime of aggression against Ukraine\". He added a \"fixed\" version of the map to his post, highlighting Crimea as part of Ukraine's territory. Among the countries that cannot play each other are Ukraine and Belarus, Spain and Gibraltar and Kosovo versus either Bosnia and Herzegovina or Serbia.\n", + "\n", + "This Twitter post cannot be displayed in your browser. Please enable Javascript or try a different browser. View original content on Twitter The BBC is not responsible for the content of external sites. Skip twitter post by Heorhii Tykhyi This article contains content provided by Twitter. We ask for your permission before anything is loaded, as they may be using cookies and other technologies. You may want to read Twitter\u2019s cookie policy, external and privacy policy, external before accepting. To view this content choose \u2018accept and continue\u2019. The BBC is not responsible for the content of external sites.\n", + "\n", + "The Ukrainian Football Association has also sent a letter to Fifa secretary-general Mathias Grafstr\u00f6m and UEFA secretary-general Theodore Theodoridis over the matter. \"We appeal to you to express our deep concern about the infographic map [shown] on December 13, 2024,\" the letter reads. \"Taking into account a number of official decisions and resolutions adopted by the Fifa Council and the UEFA executive committee since 2014... we emphasize that today's version of the cartographic image of Ukraine... is completely unacceptable and looks like an inconsistent position of Fifa and UEFA.\" The 2026 World Cup will start on 11 June that year in Mexico City and end on 19 July in New Jersey. The expanded 48-team tournament will last a record 39 days. Ukraine were placed in Group D alongside Iceland, Azerbaijan and the yet-to-be-determined winners of France's Nations League quarter-final against Croatia.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.4302, Text: Defending champions Manchester City will face Juventus in the group stage of the Fifa Club World Cup next summer, while Chelsea meet Brazilian side Flamengo. Pep Guardiola's City, who beat Brazilian side Fluminense to win the tournament for the first time in 2023, begin their title defence against Morocco's Wydad and also play Al Ain of the United Arab Emirates in Group G. Chelsea, winners of the 2021 final, were also drawn alongside Mexico's Club Leon and Tunisian side Esperance Sportive de Tunisie in Group D. The revamped Fifa Club World Cup, which has been expanded to 32 teams, will take place in the United States between 15 June and 13 July next year.\n", + "\n", + "A complex and lengthy draw ceremony was held across two separate Miami locations and lasted more than 90 minutes, during which a new Club World Cup trophy was revealed. There was also a video message from incoming US president Donald Trump, whose daughter Ivanka drew the first team. Lionel Messi's Inter Miami will take on Egyptian side Al Ahly at the Hard Rock Stadium in the opening match, staged in Miami. Elsewhere, Paris St-Germain were drawn against Atletico Madrid in Group B, while Bayern Munich meet Benfica in another all-European group-stage match-up. Teams will play each other once in the group phase and the top two will progress to the knockout stage.\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. What is the Club World Cup?\n", + "\n", + "Teams from each of the six international football confederations will be represented at next summer's tournament, including 12 European clubs - the highest quota of any confederation. The European places were decided by clubs' Champions League performances over the past four seasons, with recent winners Chelsea, Manchester City and Real Madrid guaranteed places. Al Ain, the most successful club in the UAE with 14 league titles, are owned by the country's president Sheikh Mohamed bin Zayed Al Nahyan - the older brother of City owner Sheikh Mansour. Real, who lifted the Fifa Club World Cup trophy for a record-extending fifth time in 2022, will open up against Saudi Pro League champions Al-Hilal, who currently have Neymar in their ranks. One place was reserved for a club from the host nation, which Fifa controversially awarded to Inter Miami, who will contest the tournament curtain-raiser. Messi's side were winners of the regular-season MLS Supporters' Shield but beaten in the MLS play-offs, meaning they are not this season's champions.\n", + "\u2022 None How does the new Club World Cup work & why is it so controversial?\n", + "\n", + "Matches will be played across 12 venues in the US which, alongside Canada and Mexico, also host the 2026 World Cup. Fifa is facing legal action from player unions and leagues about the scheduling of the event, which begins two weeks after the Champions League final at the end of the 2024-25 European calendar and ends five weeks before the first Premier League match of the 2025-2026 season. But football's world governing body believes the dates allow sufficient rest time before the start of the domestic campaigns. The Club World Cup will now take place once every four years, when it was previously held annually and involved just seven teams. Streaming platform DAZN has secured exclusive rights to broadcast next summer's tournament, during which 63 matches will take place over 29 days.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.4207, Text: After Fifa awards Saudi Arabia the hosting rights for the men's 2034 World Cup, BBC analysis editor Ros Atkins looks at how we got here and the controversies surrounding the decision.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.4123, Text: FA still to decide on endorsing Saudi World Cup bid\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80) # Add separator line\n", + " for doc, score in search_results:\n", + " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80) # Add separator between results\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sS0FebHI9U1l" + }, + "source": [ + "# Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "ZGUXQQmv9ge4" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-25 21:54:00,781 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "system_template = \"You are a helpful assistant that answers questions based on the provided context.\"\n", + "system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)\n", + "\n", + "human_template = \"Context: {context}\\n\\nQuestion: {question}\"\n", + "human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)\n", + "\n", + "chat_prompt = ChatPromptTemplate.from_messages([\n", + " system_message_prompt,\n", + " human_message_prompt\n", + "])\n", + "\n", + "def format_docs(docs):\n", + " return \"\\n\\n\".join(doc.page_content for doc in docs)\n", + "\n", + "rag_chain = (\n", + " {\"context\": lambda x: format_docs(vector_store.similarity_search(x)), \"question\": RunnablePassthrough()}\n", + " | chat_prompt\n", + " | llm\n", + ")\n", + "logging.info(\"Successfully created RAG chain\")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "Mia7XxM9978M" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: During the draw for the 2026 FIFA World Cup, a map was shown that excluded Crimea as part of Ukraine. This graphic, which was displaying countries that cannot be drawn to play each other for geopolitical reasons, highlighted Ukraine but did not include the Crimean peninsula, which is internationally recognized as Ukrainian territory.\n", + "\n", + "This omission sparked significant controversy because Crimea has been under Russian occupation since 2014, but only a handful of countries recognize it as Russian territory. The Ukrainian Foreign Ministry spokesman, Heorhiy Tykhy, called this an \"unacceptable error\" and stated that Ukraine expected \"a public apology\" from FIFA. He criticized FIFA for acting \"against international law\" and supporting \"Russian propaganda, war crimes, and the crime of aggression against Ukraine.\"\n", + "\n", + "The Ukrainian Football Association also sent a formal letter of complaint to FIFA and UEFA officials expressing their \"deep concern\" about the cartographic representation. FIFA acknowledged they were \"aware of an issue\" and subsequently removed the image.\n", + "RAG response generated in 6.58 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " start_time = time.time()\n", + " rag_response = rag_chain.invoke(query)\n", + " rag_elapsed_time = time.time() - start_time\n", + "\n", + " print(f\"RAG Response: {rag_response.content}\")\n", + " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", + "except AuthenticationError as e:\n", + " print(f\"Authentication error: {str(e)}\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aIdayPzw9glT" + }, + "source": [ + "# Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "0xM2G3ef-GS2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\n", + "Response: According to the context, Apple Intelligence (an AI feature that summarizes notifications) generated a false headline that made it appear as if BBC News had published an article claiming Luigi Mangione, who was arrested for the murder of healthcare insurance CEO Brian Thompson in New York, had shot himself. This was completely false - Mangione had not shot himself.\n", + "\n", + "The BBC complained to Apple about this misrepresentation, with a BBC spokesperson stating they had \"contacted Apple to raise this concern and fix the problem.\" The BBC emphasized that as \"the most trusted news media in the world,\" it's essential that audiences can trust information published in their name, including notifications.\n", + "\n", + "This wasn't an isolated incident - the context mentions that Apple's AI feature also misrepresented a New York Times article, incorrectly summarizing it as \"Netanyahu arrested\" when the actual article was about the International Criminal Court issuing an arrest warrant for the Israeli prime minister.\n", + "Time taken: 6.66 seconds\n", + "\n", + "Query 2: What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\n", + "Response: During the draw for the 2026 FIFA World Cup, a map was shown that excluded Crimea as part of Ukraine. This graphic, which was displaying countries that cannot be drawn to play each other for geopolitical reasons, highlighted Ukraine but did not include the Crimean peninsula, which is internationally recognized as Ukrainian territory.\n", + "\n", + "This omission sparked significant controversy because Crimea has been under Russian occupation since 2014, but only a handful of countries recognize it as Russian territory. The Ukrainian Foreign Ministry spokesman, Heorhiy Tykhy, called this an \"unacceptable error\" and stated that Ukraine expected \"a public apology\" from FIFA. He criticized FIFA for acting \"against international law\" and supporting \"Russian propaganda, war crimes, and the crime of aggression against Ukraine.\"\n", + "\n", + "The Ukrainian Football Association also sent a formal letter of complaint to FIFA and UEFA officials expressing their \"deep concern\" about the cartographic representation. FIFA acknowledged they were \"aware of an issue\" and subsequently removed the image.\n", + "Time taken: 0.62 seconds\n", + "\n", + "Query 3: What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\n", + "Response: According to the context, Apple Intelligence (an AI feature that summarizes notifications) generated a false headline that made it appear as if BBC News had published an article claiming Luigi Mangione, who was arrested for the murder of healthcare insurance CEO Brian Thompson in New York, had shot himself. This was completely false - Mangione had not shot himself.\n", + "\n", + "The BBC complained to Apple about this misrepresentation, with a BBC spokesperson stating they had \"contacted Apple to raise this concern and fix the problem.\" The BBC emphasized that as \"the most trusted news media in the world,\" it's essential that audiences can trust information published in their name, including notifications.\n", + "\n", + "This wasn't an isolated incident - the context mentions that Apple's AI feature also misrepresented a New York Times article, incorrectly summarizing it as \"Netanyahu arrested\" when the actual article was about the International Criminal Court issuing an arrest warrant for the Israeli prime minister.\n", + "Time taken: 0.51 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\",\n", + " \"What happened with the map shown during the 2026 FIFA World Cup draw regarding Ukraine and Crimea? What was the controversy?\", # Repeated query\n", + " \"What happened when Apple's AI feature generated a false BBC headline about a murder case in New York?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + "\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response.content}\")\n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "except AuthenticationError as e:\n", + " print(f\"Authentication error: {str(e)}\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJQ5P8E29go1" + }, + "source": [ + "## Conclusion\n", + "By following these steps, you\u2019ll have a fully functional semantic search engine that leverages the strengths of Couchbase and Claude(by Anthropic). This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you\u2019re a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/claudeai/fts/claude_index.json b/claudeai/search_based/claude_index.json similarity index 100% rename from claudeai/fts/claude_index.json rename to claudeai/search_based/claude_index.json diff --git a/claudeai/fts/frontmatter.md b/claudeai/search_based/frontmatter.md similarity index 100% rename from claudeai/fts/frontmatter.md rename to claudeai/search_based/frontmatter.md diff --git a/cohere/fts/RAG_with_Couchbase_and_Cohere.ipynb b/cohere/fts/RAG_with_Couchbase_and_Cohere.ipynb deleted file mode 100644 index 7f37963a..00000000 --- a/cohere/fts/RAG_with_Couchbase_and_Cohere.ipynb +++ /dev/null @@ -1,1019 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "zAPY14a2BOhq" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Cohere](https://cohere.com/)\n", - " as the AI-powered embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using the FTS service from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-cohere-couchbase-rag-with-global-secondary-index/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/cohere/fts/RAG_with_Couchbase_and_Cohere.ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for Cohere\n", - "\n", - "Please follow the [instructions](https://dashboard.cohere.com/welcome/register) to generate the Cohere credentials.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EYZzrd_tBdUC" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "cYUkZqeoEykk" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.3.0 langchain-cohere==0.4.4 python-dotenv==1.1.0" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Dw3IL3GEJSj7" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "oziN03NZJLQw" - }, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "from uuid import uuid4\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - " ServiceUnavailableException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_cohere import ChatCohere, CohereEmbeddings\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.output_parsers import StrOutputParser\n", - "from langchain_core.prompts import ChatPromptTemplate\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iXwzTRdbCLL1" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "R-SanCZrCLdm" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s',force=True)\n", - "\n", - "# Supress Excessive logging\n", - "logging.getLogger('openai').setLevel(logging.WARNING)\n", - "logging.getLogger('httpx').setLevel(logging.WARNING)\n", - "logging.getLogger('langchain_cohere').setLevel(logging.ERROR)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zOwSwRoHJLXv" - }, - "source": [ - "# Loading Sensitive Informnation\n", - "In this section, we prompt the user to input essential configuration settings needed for integrating Couchbase with Cohere's API. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "y2H9xphrJLbP" - }, - "outputs": [], - "source": [ - "load_dotenv()\n", - "\n", - "COHERE_API_KEY = os.getenv('COHERE_API_KEY') or getpass.getpass('Enter your Cohere API key: ')\n", - "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", - "INDEX_NAME = os.getenv('INDEX_NAME') or input('Enter your index name (default: vector_search_cohere): ') or 'vector_search_cohere'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: cohere): ') or 'cohere'\n", - "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not COHERE_API_KEY:\n", - " raise ValueError(\"COHERE_API_KEY is not provided and is required.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sdKdLg9pJLl5" - }, - "source": [ - "# Connect to Couchbase\n", - "The script attempts to establish a connection to the Couchbase database using the credentials retrieved from the environment variables. Couchbase is a NoSQL database known for its flexibility, scalability, and support for various data models, including document-based storage. The connection is authenticated using a username and password, and the script waits until the connection is fully established before proceeding.\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "HubiGMCSJLqw" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:27:13,562 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Creates primary index on collection for query performance\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:27:14,806 - INFO - Bucket 'vector-search-testing' exists.\n", - "2025-02-06 01:27:17,199 - INFO - Collection 'cohere' already exists. Skipping creation.\n", - "2025-02-06 01:27:20,585 - INFO - Primary index present or created successfully.\n", - "2025-02-06 01:27:20,888 - INFO - All documents cleared from the collection.\n", - "2025-02-06 01:27:20,889 - INFO - Bucket 'vector-search-testing' exists.\n", - "2025-02-06 01:27:23,271 - INFO - Collection 'cache' already exists. Skipping creation.\n", - "2025-02-06 01:27:26,258 - INFO - Primary index present or created successfully.\n", - "2025-02-06 01:27:26,497 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Ensure primary index exists\n", - " try:\n", - " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", - " logging.info(\"Primary index present or created successfully.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error creating primary index: {str(e)}\")\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "j4tYSkkDxS9O" - }, - "source": [ - "# Loading Couchbase Vector Search Index\n", - "\n", - "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", - "\n", - "This Cohere vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `cohere`. The configuration is set up for vectors with exactly `1024 dimensions`, using dot product similarity and optimized for recall. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", - "\n", - "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "szXN-oNGxTMF" - }, - "outputs": [], - "source": [ - "# If you are running this script locally (not in Google Colab), uncomment the following line\n", - "# and provide the path to your index definition file.\n", - "\n", - "# index_definition_path = '/path_to_your_index_file/cohere_index.json' # Local setup: specify your file path here\n", - "\n", - "# # Version for Google Colab\n", - "# def load_index_definition_colab():\n", - "# from google.colab import files\n", - "# print(\"Upload your index definition file\")\n", - "# uploaded = files.upload()\n", - "# index_definition_path = list(uploaded.keys())[0]\n", - "\n", - "# try:\n", - "# with open(index_definition_path, 'r') as file:\n", - "# index_definition = json.load(file)\n", - "# return index_definition\n", - "# except Exception as e:\n", - "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Version for Local Environment\n", - "def load_index_definition_local(index_definition_path):\n", - " try:\n", - " with open(index_definition_path, 'r') as file:\n", - " index_definition = json.load(file)\n", - " return index_definition\n", - " except Exception as e:\n", - " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Usage\n", - "# Uncomment the appropriate line based on your environment\n", - "# index_definition = load_index_definition_colab()\n", - "index_definition = load_index_definition_local('cohere_index.json')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TXGj5YokJLuU" - }, - "source": [ - "# Creating or Updating Search Indexes\n", - "\n", - "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "VHeB_AVmLJlx" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:27:27,729 - INFO - Index 'vector_search_cohere' found\n", - "2025-02-06 01:27:28,595 - INFO - Index 'vector_search_cohere' already exists. Skipping creation/update.\n" - ] - } - ], - "source": [ - "try:\n", - " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", - "\n", - " # Check if index already exists\n", - " existing_indexes = scope_index_manager.get_all_indexes()\n", - " index_name = index_definition[\"name\"]\n", - "\n", - " if index_name in [index.name for index in existing_indexes]:\n", - " logging.info(f\"Index '{index_name}' found\")\n", - " else:\n", - " logging.info(f\"Creating new index '{index_name}'...\")\n", - "\n", - " # Create SearchIndex object from JSON definition\n", - " search_index = SearchIndex.from_json(index_definition)\n", - "\n", - " # Upsert the index (create if not exists, update if exists)\n", - " scope_index_manager.upsert_index(search_index)\n", - " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", - "\n", - "except QueryIndexAlreadyExistsException:\n", - " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", - "except ServiceUnavailableException:\n", - " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", - "except InternalServerFailureException as e:\n", - " logging.error(f\"Internal server error: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LT3s8x_Mx3KG" - }, - "source": [ - "# Create Embeddings\n", - "Embeddings are created using the Cohere API. Embeddings are vectors (arrays of numbers) that represent the meaning of text in a high-dimensional space. These embeddings are crucial for tasks like semantic search, where the goal is to find text that is semantically similar to a query. The script uses a pre-trained model provided by Cohere to generate embeddings for the text in the TREC dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "A6fG7Mopx3Np" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:27:28,613 - INFO - Successfully created CohereEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = CohereEmbeddings(\n", - " cohere_api_key=COHERE_API_KEY,\n", - " model=\"embed-english-v3.0\",\n", - " )\n", - " logging.info(\"Successfully created CohereEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating CohereEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iar2fABrLJjK" - }, - "source": [ - "# Set Up Vector Store\n", - "The vector store is set up to manage the embeddings created in the previous step. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is built on top of Couchbase, allowing the script to store the embeddings in a way that can be efficiently searched.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "cjASXR3dLJgZ" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:27:32,177 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseSearchVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " index_name=INDEX_NAME,\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:27:38,003 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of 50 articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "3. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 50 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:29:07,077 - INFO - Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 50\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ToQ2acrSLJY7" - }, - "source": [ - "# Set Up Cache\n", - " A cache is set up using Couchbase to store intermediate results and frequently accessed data. Caching is important for improving performance, as it reduces the need to repeatedly calculate or retrieve the same data. The cache is linked to a specific collection in Couchbase, and it is used later in the script to store the results of language model queries.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "id": "qZDXvq88LJWH" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:30:37,657 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GQpib0zKLJTh" - }, - "source": [ - "# Create Language Model (LLM)\n", - "The script initializes a Cohere language model (LLM) that will be used for generating responses to queries. LLMs are powerful tools for natural language understanding and generation, capable of producing human-like text based on input prompts. The model is configured with specific parameters, such as the temperature, which controls the randomness of its outputs.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "7eV1X5xILJRC" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:30:38,684 - INFO - Successfully created Cohere LLM with model command\n" - ] - } - ], - "source": [ - "try:\n", - " llm = ChatCohere(\n", - " cohere_api_key=COHERE_API_KEY,\n", - " model=\"command-a-03-2025\",\n", - " temperature=0\n", - " )\n", - " logging.info(\"Successfully created Cohere LLM with model command\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating Cohere LLM: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wQ0fNbphbWpu" - }, - "source": [ - "# Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. \n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the similarity_search_with_score method of the CouchbaseSearchVectorStore. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "id": "udcxHyloyoxE" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:30:43,101 - INFO - Semantic search completed in 1.89 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 1.89 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.6641, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", - "\n", - "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", - "\n", - "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.6521, Text: 'We have to find a way' - Guardiola vows to end relegation form\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. 'Worrying' and 'staggering' - Why do Manchester City keep conceding?\n", - "\n", - "Manchester City are currently in relegation form and there is little sign of it ending. Saturday's 2-1 defeat at Aston Villa left them joint bottom of the form table over the past eight games with just Southampton for company. Saints, at the foot of the Premier League, have the same number of points, four, as City over their past eight matches having won one, drawn one and lost six - the same record as the floundering champions. And if Southampton - who appointed Ivan Juric as their new manager on Saturday - get at least a point at Fulham on Sunday, City will be on the worst run in the division. Even Wolves, who sacked boss Gary O'Neil last Sunday and replaced him with Vitor Pereira, have earned double the number of points during the same period having played a game fewer. They are damning statistics for Pep Guardiola, even if he does have some mitigating circumstances with injuries to Ederson, Nathan Ake and Ruben Dias - who all missed the loss at Villa Park - and the long-term loss of midfield powerhouse Rodri. Guardiola was happy with Saturday's performance, despite defeat in Birmingham, but there is little solace to take at slipping further out of the title race. He may have needed to field a half-fit Manuel Akanji and John Stones at Villa Park but that does not account for City looking a shadow of their former selves. That does not justify the error Josko Gvardiol made to gift Jhon Duran a golden chance inside the first 20 seconds, or £100m man Jack Grealish again failing to have an impact on a game. There may be legitimate reasons for City's drop off, whether that be injuries, mental fatigue or just simply a team coming to the end of its lifecycle, but their form, which has plunged off a cliff edge, would have been unthinkable as they strolled to a fourth straight title last season. \"The worrying thing is the number of goals conceded,\" said ex-England captain Alan Shearer on BBC Match of the Day. \"The number of times they were opened up because of the lack of protection and legs in midfield was staggering. There are so many things that are wrong at this moment in time.\"\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. Man City 'have to find a way' to return to form - Guardiola\n", - "\n", - "Afterwards Guardiola was calm, so much so it was difficult to hear him in the news conference, a contrast to the frustrated figure he cut on the touchline. He said: \"It depends on us. The solution is bring the players back. We have just one central defender fit, that is difficult. We are going to try next game - another opportunity and we don't think much further than that. \"Of course there are more reasons. We concede the goals we don't concede in the past, we [don't] score the goals we score in the past. Football is not just one reason. There are a lot of little factors. \"Last season we won the Premier League, but we came here and lost. We have to think positive and I have incredible trust in the guys. Some of them have incredible pride and desire to do it. We have to find a way, step by step, sooner or later to find a way back.\" Villa boss Unai Emery highlighted City's frailties, saying he felt Villa could seize on the visitors' lack of belief. \"Manchester City are a little bit under the confidence they have normally,\" he said. \"The second half was different, we dominated and we scored. Through those circumstances they were feeling worse than even in the first half.\"\n", - "\n", - "Erling Haaland had one touch in the Villa box\n", - "\n", - "There are chinks in the armour never seen before at City under Guardiola and Erling Haaland conceded belief within the squad is low. He told TNT after the game: \"Of course, [confidence levels are] not the best. We know how important confidence is and you can see that it affects every human being. That is how it is, we have to continue and stay positive even though it is difficult.\" Haaland, with 76 goals in 83 Premier League appearances since joining City from Borussia Dortmund in 2022, had one shot and one touch in the Villa box. His 18 touches in the whole game were the lowest of all starting players and he has been self critical, despite scoring 13 goals in the top flight this season. Over City's last eight games he has netted just twice though, but Guardiola refused to criticise his star striker. He said: \"Without him we will be even worse but I like the players feeling that way. I don't agree with Erling. He needs to have the balls delivered in the right spots but he will fight for the next one.\"\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.6322, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", - "\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80) # Add separator line\n", - " for doc, score in search_results:\n", - " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80) # Add separator between results\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Bt44X6-bLJOb" - }, - "source": [ - "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "id": "6cGJfwS2LI_O" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-06 01:30:46,088 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "try:\n", - " template = \"\"\"You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:\n", - " {context}\n", - "\n", - " Question: {question}\"\"\"\n", - " prompt = ChatPromptTemplate.from_template(template)\n", - "\n", - " rag_chain = (\n", - " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", - " | prompt\n", - " | llm\n", - " | StrOutputParser()\n", - " )\n", - " logging.info(\"Successfully created RAG chain\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating RAG chain: {str(e)}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": { - "id": "PvuJyXPUFOux" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: Manchester City manager Pep Guardiola has been open about the impact the team's poor form has had on him personally. He has admitted that his sleep and diet have been affected, and that he has been feeling \"ugly\" and uncomfortable. Guardiola has also been giving a lot of thought to the reasons for the team's decline, talking to many people and trying to work out the causes. He has been very protective of his players, refusing to criticise them and instead giving them more days off to clear their heads.\n", - "\n", - "Guardiola has also been very self-critical, saying that he is \"not good enough\" and that he needs to find solutions to the team's problems. He has acknowledged that the team is not performing as well as it used to, and that there are many factors contributing to their poor form, including injuries, mental fatigue, and a lack of confidence. He has also suggested that the team needs to improve its defensive concepts and re-establish its intensity.\n", - "\n", - "Overall, Guardiola seems to be taking a very hands-on approach to the team's struggles, trying to find solutions and protect his players while also being very honest about his own role in the situation.\n", - "RAG response generated in 9.52 seconds\n" - ] - } - ], - "source": [ - "start_time = time.time()\n", - "try:\n", - " rag_response = rag_chain.invoke(query)\n", - " rag_elapsed_time = time.time() - start_time\n", - " print(f\"RAG Response: {rag_response}\")\n", - " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cUXEVXyxGlv2" - }, - "source": [ - "# Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": { - "id": "J_PaTD2aGmGt" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: What happened in the match between Fullham and Liverpool?\n", - "Response: Liverpool and Fulham played out a thrilling 2-2 draw at Anfield. Liverpool were reduced to 10 men after Andy Robertson was sent off in the 17th minute, but they fought back twice to earn a point. The Reds dominated the match despite their numerical disadvantage, with over 60% possession and leading in several attacking metrics. Diogo Jota scored the equaliser in the 86th minute, capping off an impressive performance that showcased Liverpool's title credentials.\n", - "Time taken: 5.29 seconds\n", - "\n", - "Query 2: What was manchester city manager pep guardiola's reaction to the team's current form?\n", - "Response: Manchester City manager Pep Guardiola has been open about the impact the team's poor form has had on him personally. He has admitted that his sleep and diet have been affected, and that he has been feeling \"ugly\" and uncomfortable. Guardiola has also been giving a lot of thought to the reasons for the team's decline, talking to many people and trying to work out the causes. He has been very protective of his players, refusing to criticise them and instead giving them more days off to clear their heads.\n", - "\n", - "Guardiola has also been very self-critical, saying that he is \"not good enough\" and that he needs to find solutions to the team's problems. He has acknowledged that the team is not performing as well as it used to, and that there are many factors contributing to their poor form, including injuries, mental fatigue, and a lack of confidence. He has also suggested that the team needs to improve its defensive concepts and re-establish its intensity.\n", - "\n", - "Overall, Guardiola seems to be taking a very hands-on approach to the team's struggles, trying to find solutions and protect his players while also being very honest about his own role in the situation.\n", - "Time taken: 2.13 seconds\n", - "\n", - "Query 3: What happened in the match between Fullham and Liverpool?\n", - "Response: Liverpool and Fulham played out a thrilling 2-2 draw at Anfield. Liverpool were reduced to 10 men after Andy Robertson was sent off in the 17th minute, but they fought back twice to earn a point. The Reds dominated the match despite their numerical disadvantage, with over 60% possession and leading in several attacking metrics. Diogo Jota scored the equaliser in the 86th minute, capping off an impressive performance that showcased Liverpool's title credentials.\n", - "Time taken: 1.36 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"What happened in the match between Fullham and Liverpool?\",\n", - " \"What was manchester city manager pep guardiola's reaction to the team's current form?\", # Repeated query\n", - " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response}\")\n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and Cohere. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "accelerator": "TPU", - "colab": { - "gpuType": "V28", - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.2" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/cohere/gsi/RAG_with_Couchbase_and_Cohere.ipynb b/cohere/gsi/RAG_with_Couchbase_and_Cohere.ipynb deleted file mode 100644 index 9b724761..00000000 --- a/cohere/gsi/RAG_with_Couchbase_and_Cohere.ipynb +++ /dev/null @@ -1,1059 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "zAPY14a2BOhq" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Cohere](https://cohere.com/)\n", - " as the AI-powered embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using GSI( Global Secondary Index) from scratch. Alternatively if you want to perform semantic search using the FTS index, please take a look at [this.](https://developer.couchbase.com/tutorial-cohere-couchbase-rag-with-fts/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/cohere/gsi/RAG_with_Couchbase_and_Cohere.ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for Cohere\n", - "\n", - "Please follow the [instructions](https://dashboard.cohere.com/welcome/register) to generate the Cohere credentials.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EYZzrd_tBdUC" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "cYUkZqeoEykk" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-cohere==0.4.5 python-dotenv==1.1.1" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Dw3IL3GEJSj7" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "oziN03NZJLQw" - }, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "from uuid import uuid4\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - " ServiceUnavailableException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_cohere import ChatCohere, CohereEmbeddings\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.output_parsers import StrOutputParser\n", - "from langchain_core.prompts import ChatPromptTemplate\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy\n", - "from langchain_couchbase.vectorstores import IndexType" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iXwzTRdbCLL1" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "R-SanCZrCLdm" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s',force=True)\n", - "\n", - "# Supress Excessive logging\n", - "logging.getLogger('openai').setLevel(logging.WARNING)\n", - "logging.getLogger('httpx').setLevel(logging.WARNING)\n", - "logging.getLogger('langchain_cohere').setLevel(logging.ERROR)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zOwSwRoHJLXv" - }, - "source": [ - "# Loading Sensitive Informnation\n", - "In this section, we prompt the user to input essential configuration settings needed for integrating Couchbase with Cohere's API. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "y2H9xphrJLbP" - }, - "outputs": [], - "source": [ - "load_dotenv()\n", - "\n", - "COHERE_API_KEY = os.getenv('COHERE_API_KEY') or getpass.getpass('Enter your Cohere API key: ')\n", - "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: cohere): ') or 'cohere'\n", - "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not COHERE_API_KEY:\n", - " raise ValueError(\"COHERE_API_KEY is not provided and is required.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sdKdLg9pJLl5" - }, - "source": [ - "# Connect to Couchbase\n", - "The script attempts to establish a connection to the Couchbase database using the credentials retrieved from the environment variables. Couchbase is a NoSQL database known for its flexibility, scalability, and support for various data models, including document-based storage. The connection is authenticated using a username and password, and the script waits until the connection is fully established before proceeding.\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "HubiGMCSJLqw" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:56:30,972 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-15 12:43:04,085 - INFO - Bucket 'query-vector-search-testing' exists.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-15 12:43:04,101 - INFO - Collection 'cohere' already exists. Skipping creation.\n", - "2025-09-15 12:43:06,191 - INFO - All documents cleared from the collection.\n", - "2025-09-15 12:43:06,193 - INFO - Bucket 'query-vector-search-testing' exists.\n", - "2025-09-15 12:43:06,199 - INFO - Collection 'cache' already exists. Skipping creation.\n", - "2025-09-15 12:43:08,367 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LT3s8x_Mx3KG" - }, - "source": [ - "# Create Embeddings\n", - "Embeddings are created using the Cohere API. Embeddings are vectors (arrays of numbers) that represent the meaning of text in a high-dimensional space. These embeddings are crucial for tasks like semantic search, where the goal is to find text that is semantically similar to a query. The script uses a pre-trained model provided by Cohere to generate embeddings for the text in the TREC dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "A6fG7Mopx3Np" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:56:36,813 - INFO - Successfully created CohereEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = CohereEmbeddings(\n", - " cohere_api_key=COHERE_API_KEY,\n", - " model=\"embed-english-v3.0\",\n", - " )\n", - " logging.info(\"Successfully created CohereEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating CohereEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iar2fABrLJjK" - }, - "source": [ - "# Set Up Vector Store\n", - "The vector store is set up to manage the embeddings created in the previous step. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is built on top of Couchbase, allowing the script to store the embeddings in a way that can be efficiently searched.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "cjASXR3dLJgZ" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:56:39,259 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseQueryVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding = embeddings,\n", - " distance_metric=DistanceStrategy.COSINE\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-15 12:43:32,383 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of 50 articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "3. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 50 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-15 12:45:26,834 - INFO - Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 50\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GQpib0zKLJTh" - }, - "source": [ - "# Create Language Model (LLM)\n", - "The script initializes a Cohere language model (LLM) that will be used for generating responses to queries. LLMs are powerful tools for natural language understanding and generation, capable of producing human-like text based on input prompts. The model is configured with specific parameters, such as the temperature, which controls the randomness of its outputs.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "7eV1X5xILJRC" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:58:23,399 - INFO - Successfully created Cohere LLM with model command\n" - ] - } - ], - "source": [ - "try:\n", - " llm = ChatCohere(\n", - " cohere_api_key=COHERE_API_KEY,\n", - " model=\"command-a-03-2025\",\n", - " temperature=0\n", - " )\n", - " logging.info(\"Successfully created Cohere LLM with model command\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating Cohere LLM: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wQ0fNbphbWpu" - }, - "source": [ - "# Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "udcxHyloyoxE" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:59:03,622 - INFO - Semantic search completed in 1.18 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 1.18 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3359, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", - "\n", - "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", - "\n", - "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3477, Text: 'We have to find a way' - Guardiola vows to end relegation form\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. 'Worrying' and 'staggering' - Why do Manchester City keep conceding?\n", - "\n", - "Manchester City are currently in relegation form and there is little sign of it ending. Saturday's 2-1 defeat at Aston Villa left them joint bottom of the form table over the past eight games with just Southampton for company. Saints, at the foot of the Premier League, have the same number of points, four, as City over their past eight matches having won one, drawn one and lost six - the same record as the floundering champions. And if Southampton - who appointed Ivan Juric as their new manager on Saturday - get at least a point at Fulham on Sunday, City will be on the worst run in the division. Even Wolves, who sacked boss Gary O'Neil last Sunday and replaced him with Vitor Pereira, have earned double the number of points during the same period having played a game fewer. They are damning statistics for Pep Guardiola, even if he does have some mitigating circumstances with injuries to Ederson, Nathan Ake and Ruben Dias - who all missed the loss at Villa Park - and the long-term loss of midfield powerhouse Rodri. Guardiola was happy with Saturday's performance, despite defeat in Birmingham, but there is little solace to take at slipping further out of the title race. He may have needed to field a half-fit Manuel Akanji and John Stones at Villa Park but that does not account for City looking a shadow of their former selves. That does not justify the error Josko Gvardiol made to gift Jhon Duran a golden chance inside the first 20 seconds, or £100m man Jack Grealish again failing to have an impact on a game. There may be legitimate reasons for City's drop off, whether that be injuries, mental fatigue or just simply a team coming to the end of its lifecycle, but their form, which has plunged off a cliff edge, would have been unthinkable as they strolled to a fourth straight title last season. \"The worrying thing is the number of goals conceded,\" said ex-England captain Alan Shearer on BBC Match of the Day. \"The number of times they were opened up because of the lack of protection and legs in midfield was staggering. There are so many things that are wrong at this moment in time.\"\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. Man City 'have to find a way' to return to form - Guardiola\n", - "\n", - "Afterwards Guardiola was calm, so much so it was difficult to hear him in the news conference, a contrast to the frustrated figure he cut on the touchline. He said: \"It depends on us. The solution is bring the players back. We have just one central defender fit, that is difficult. We are going to try next game - another opportunity and we don't think much further than that. \"Of course there are more reasons. We concede the goals we don't concede in the past, we [don't] score the goals we score in the past. Football is not just one reason. There are a lot of little factors. \"Last season we won the Premier League, but we came here and lost. We have to think positive and I have incredible trust in the guys. Some of them have incredible pride and desire to do it. We have to find a way, step by step, sooner or later to find a way back.\" Villa boss Unai Emery highlighted City's frailties, saying he felt Villa could seize on the visitors' lack of belief. \"Manchester City are a little bit under the confidence they have normally,\" he said. \"The second half was different, we dominated and we scored. Through those circumstances they were feeling worse than even in the first half.\"\n", - "\n", - "Erling Haaland had one touch in the Villa box\n", - "\n", - "There are chinks in the armour never seen before at City under Guardiola and Erling Haaland conceded belief within the squad is low. He told TNT after the game: \"Of course, [confidence levels are] not the best. We know how important confidence is and you can see that it affects every human being. That is how it is, we have to continue and stay positive even though it is difficult.\" Haaland, with 76 goals in 83 Premier League appearances since joining City from Borussia Dortmund in 2022, had one shot and one touch in the Villa box. His 18 touches in the whole game were the lowest of all starting players and he has been self critical, despite scoring 13 goals in the top flight this season. Over City's last eight games he has netted just twice though, but Guardiola refused to criticise his star striker. He said: \"Without him we will be even worse but I like the players feeling that way. I don't agree with Erling. He needs to have the balls delivered in the right spots but he will fight for the next one.\"\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3677, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", - "\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80) # Add separator line\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80) # Add separator between results\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Optimizing Vector Search with Global Secondary Index (GSI)\n", - "\n", - "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Global Secondary Index (GSI) in Couchbase.\n", - "\n", - "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", - "\n", - "Hyperscale Vector Indexes (BHIVE)\n", - "- Best for pure vector searches - content discovery, recommendations, semantic search\n", - "- High performance with low memory footprint - designed to scale to billions of vectors\n", - "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", - "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", - "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", - "\n", - "Composite Vector Indexes \n", - "- Best for filtered vector searches - combines vector search with scalar value filtering\n", - "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", - "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", - "\n", - "Choosing the Right Index Type\n", - "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", - "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", - "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", - "\n", - "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", - "\n", - "\n", - "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", - "\n", - "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", - "\n", - "Format: `'IVF[],{PQ|SQ}'`\n", - "\n", - "Centroids (IVF - Inverted File):\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- More centroids = faster search, slower training \n", - "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", - "\n", - "Quantization Options:\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", - "- Higher values = better accuracy, larger index size\n", - "\n", - "Common Examples:\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"cohere_bhive_index\",index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The example below shows running the same similarity search, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data.\n", - "\n", - "**Important**: When using Composite indexes, scalar filters take precedence over vector similarity, which can improve performance for filtered searches but may miss some semantically relevant results that don't match the scalar criteria.\n", - "\n", - "Note: In GSI vector search, the distance represents the vector distance between the query and document embeddings. Lower distance indicate higher similarity, while higher distance indicate lower similarity." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:59:26,949 - INFO - Semantic search completed in 0.38 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 0.38 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3359, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", - "\n", - "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", - "\n", - "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3477, Text: 'We have to find a way' - Guardiola vows to end relegation form\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. 'Worrying' and 'staggering' - Why do Manchester City keep conceding?\n", - "\n", - "Manchester City are currently in relegation form and there is little sign of it ending. Saturday's 2-1 defeat at Aston Villa left them joint bottom of the form table over the past eight games with just Southampton for company. Saints, at the foot of the Premier League, have the same number of points, four, as City over their past eight matches having won one, drawn one and lost six - the same record as the floundering champions. And if Southampton - who appointed Ivan Juric as their new manager on Saturday - get at least a point at Fulham on Sunday, City will be on the worst run in the division. Even Wolves, who sacked boss Gary O'Neil last Sunday and replaced him with Vitor Pereira, have earned double the number of points during the same period having played a game fewer. They are damning statistics for Pep Guardiola, even if he does have some mitigating circumstances with injuries to Ederson, Nathan Ake and Ruben Dias - who all missed the loss at Villa Park - and the long-term loss of midfield powerhouse Rodri. Guardiola was happy with Saturday's performance, despite defeat in Birmingham, but there is little solace to take at slipping further out of the title race. He may have needed to field a half-fit Manuel Akanji and John Stones at Villa Park but that does not account for City looking a shadow of their former selves. That does not justify the error Josko Gvardiol made to gift Jhon Duran a golden chance inside the first 20 seconds, or £100m man Jack Grealish again failing to have an impact on a game. There may be legitimate reasons for City's drop off, whether that be injuries, mental fatigue or just simply a team coming to the end of its lifecycle, but their form, which has plunged off a cliff edge, would have been unthinkable as they strolled to a fourth straight title last season. \"The worrying thing is the number of goals conceded,\" said ex-England captain Alan Shearer on BBC Match of the Day. \"The number of times they were opened up because of the lack of protection and legs in midfield was staggering. There are so many things that are wrong at this moment in time.\"\n", - "\n", - "This video can not be played To play this video you need to enable JavaScript in your browser. Man City 'have to find a way' to return to form - Guardiola\n", - "\n", - "Afterwards Guardiola was calm, so much so it was difficult to hear him in the news conference, a contrast to the frustrated figure he cut on the touchline. He said: \"It depends on us. The solution is bring the players back. We have just one central defender fit, that is difficult. We are going to try next game - another opportunity and we don't think much further than that. \"Of course there are more reasons. We concede the goals we don't concede in the past, we [don't] score the goals we score in the past. Football is not just one reason. There are a lot of little factors. \"Last season we won the Premier League, but we came here and lost. We have to think positive and I have incredible trust in the guys. Some of them have incredible pride and desire to do it. We have to find a way, step by step, sooner or later to find a way back.\" Villa boss Unai Emery highlighted City's frailties, saying he felt Villa could seize on the visitors' lack of belief. \"Manchester City are a little bit under the confidence they have normally,\" he said. \"The second half was different, we dominated and we scored. Through those circumstances they were feeling worse than even in the first half.\"\n", - "\n", - "Erling Haaland had one touch in the Villa box\n", - "\n", - "There are chinks in the armour never seen before at City under Guardiola and Erling Haaland conceded belief within the squad is low. He told TNT after the game: \"Of course, [confidence levels are] not the best. We know how important confidence is and you can see that it affects every human being. That is how it is, we have to continue and stay positive even though it is difficult.\" Haaland, with 76 goals in 83 Premier League appearances since joining City from Borussia Dortmund in 2022, had one shot and one touch in the Villa box. His 18 touches in the whole game were the lowest of all starting players and he has been self critical, despite scoring 13 goals in the top flight this season. Over City's last eight games he has netted just twice though, but Guardiola refused to criticise his star striker. He said: \"Without him we will be even worse but I like the players feeling that way. I don't agree with Erling. He needs to have the balls delivered in the right spots but he will fight for the next one.\"\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3677, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", - "\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80) # Add separator line\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80) # Add separator between results\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: To create a COMPOSITE index, the below code can be used.\n", - "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"cohere_composite_index\", index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Set Up Cache\n", - " A cache is set up using Couchbase to store intermediate results and frequently accessed data. Caching is important for improving performance, as it reduces the need to repeatedly calculate or retrieve the same data. The cache is linked to a specific collection in Couchbase, and it is used later in the script to store the results of language model queries.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-22 12:59:40,381 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Bt44X6-bLJOb" - }, - "source": [ - "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": { - "id": "6cGJfwS2LI_O" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-15 12:53:46,979 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "try:\n", - " template = \"\"\"You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:\n", - " {context}\n", - "\n", - " Question: {question}\"\"\"\n", - " prompt = ChatPromptTemplate.from_template(template)\n", - "\n", - " rag_chain = (\n", - " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", - " | prompt\n", - " | llm\n", - " | StrOutputParser()\n", - " )\n", - " logging.info(\"Successfully created RAG chain\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating RAG chain: {str(e)}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": { - "id": "PvuJyXPUFOux" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: Manchester City manager Pep Guardiola has expressed concern and frustration over the team's recent form, describing it as the \"worst run of results\" in his managerial career. He has admitted that the situation has affected his sleep and diet, stating that his state of mind is \"ugly\" and his sleep is \"worse.\" Guardiola has also acknowledged the need for the team to defend better and avoid making mistakes at both ends of the pitch. Despite the challenges, he remains focused on finding solutions and has emphasized the importance of bringing injured players back to the squad. Guardiola has also highlighted the need for the team to recover its essence by improving defensive concepts and re-establishing the intensity they are known for. He has taken a self-critical approach, stating that he is \"not good enough\" to resolve the situation with the current group of players and has vowed to find solutions to turn the team's form around.\n", - "RAG response generated in 4.09 seconds\n" - ] - } - ], - "source": [ - "start_time = time.time()\n", - "try:\n", - " rag_response = rag_chain.invoke(query)\n", - " rag_elapsed_time = time.time() - start_time\n", - " print(f\"RAG Response: {rag_response}\")\n", - " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cUXEVXyxGlv2" - }, - "source": [ - "# Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": { - "id": "J_PaTD2aGmGt" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: What happened in the match between Fullham and Liverpool?\n", - "Response: In the match between Fulham and Liverpool, Liverpool played with 10 men for 89 minutes after Andy Robertson received a red card in the 17th minute. Despite this numerical disadvantage, Liverpool managed to secure a 2-2 draw at Anfield. Fulham took the lead twice, but Liverpool responded both times, with Diogo Jota scoring an 86th-minute equalizer. The performance highlighted Liverpool's resilience and title credentials, with Fulham's Antonee Robinson praising Liverpool for not seeming like they were a man down. Liverpool maintained over 60% possession and dominated attacking metrics, showcasing their ability to fight back under adversity.\n", - "Time taken: 2.12 seconds\n", - "\n", - "Query 2: What was manchester city manager pep guardiola's reaction to the team's current form?\n", - "Response: Manchester City manager Pep Guardiola has expressed concern and frustration over the team's recent form, describing it as the \"worst run of results\" in his managerial career. He has admitted that the situation has affected his sleep and diet, stating that his state of mind is \"ugly\" and his sleep is \"worse.\" Guardiola has also acknowledged the need for the team to defend better and avoid making mistakes at both ends of the pitch. Despite the challenges, he remains focused on finding solutions and has emphasized the importance of bringing injured players back to the squad. Guardiola has also highlighted the need for the team to recover its essence by improving defensive concepts and re-establishing the intensity they are known for. He has taken a self-critical approach, stating that he is \"not good enough\" to resolve the situation with the current group of players and has vowed to find solutions to turn the team's form around.\n", - "Time taken: 0.35 seconds\n", - "\n", - "Query 3: What happened in the match between Fullham and Liverpool?\n", - "Response: In the match between Fulham and Liverpool, Liverpool played with 10 men for 89 minutes after Andy Robertson received a red card in the 17th minute. Despite this numerical disadvantage, Liverpool managed to secure a 2-2 draw at Anfield. Fulham took the lead twice, but Liverpool responded both times, with Diogo Jota scoring an 86th-minute equalizer. The performance highlighted Liverpool's resilience and title credentials, with Fulham's Antonee Robinson praising Liverpool for not seeming like they were a man down. Liverpool maintained over 60% possession and dominated attacking metrics, showcasing their ability to fight back under adversity.\n", - "Time taken: 0.35 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"What happened in the match between Fullham and Liverpool?\",\n", - " \"What was manchester city manager pep guardiola's reaction to the team's current form?\", # Repeated query\n", - " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response}\")\n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and Cohere. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using GSI which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "accelerator": "TPU", - "colab": { - "gpuType": "V28", - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/cohere/gsi/.env.sample b/cohere/query_based/.env.sample similarity index 100% rename from cohere/gsi/.env.sample rename to cohere/query_based/.env.sample diff --git a/cohere/query_based/RAG_with_Couchbase_and_Cohere.ipynb b/cohere/query_based/RAG_with_Couchbase_and_Cohere.ipynb new file mode 100644 index 00000000..94ee1897 --- /dev/null +++ b/cohere/query_based/RAG_with_Couchbase_and_Cohere.ipynb @@ -0,0 +1,1059 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "zAPY14a2BOhq" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Cohere](https://cohere.com/)\n", + " as the AI-powered embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Hyperscale and Composite Vector Indexes from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Search Vector Index, please take a look at [this.](https://developer.couchbase.com/tutorial-cohere-couchbase-rag-with-search-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/cohere/gsi/RAG_with_Couchbase_and_Cohere.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for Cohere\n", + "\n", + "Please follow the [instructions](https://dashboard.cohere.com/welcome/register) to generate the Cohere credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EYZzrd_tBdUC" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cYUkZqeoEykk" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-cohere==0.4.5 python-dotenv==1.1.1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dw3IL3GEJSj7" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "oziN03NZJLQw" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "from uuid import uuid4\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + " ServiceUnavailableException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_cohere import ChatCohere, CohereEmbeddings\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy\n", + "from langchain_couchbase.vectorstores import IndexType" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iXwzTRdbCLL1" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "R-SanCZrCLdm" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s',force=True)\n", + "\n", + "# Supress Excessive logging\n", + "logging.getLogger('openai').setLevel(logging.WARNING)\n", + "logging.getLogger('httpx').setLevel(logging.WARNING)\n", + "logging.getLogger('langchain_cohere').setLevel(logging.ERROR)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zOwSwRoHJLXv" + }, + "source": [ + "# Loading Sensitive Informnation\n", + "In this section, we prompt the user to input essential configuration settings needed for integrating Couchbase with Cohere's API. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "y2H9xphrJLbP" + }, + "outputs": [], + "source": [ + "load_dotenv()\n", + "\n", + "COHERE_API_KEY = os.getenv('COHERE_API_KEY') or getpass.getpass('Enter your Cohere API key: ')\n", + "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: cohere): ') or 'cohere'\n", + "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not COHERE_API_KEY:\n", + " raise ValueError(\"COHERE_API_KEY is not provided and is required.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sdKdLg9pJLl5" + }, + "source": [ + "# Connect to Couchbase\n", + "The script attempts to establish a connection to the Couchbase database using the credentials retrieved from the environment variables. Couchbase is a NoSQL database known for its flexibility, scalability, and support for various data models, including document-based storage. The connection is authenticated using a username and password, and the script waits until the connection is fully established before proceeding.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "HubiGMCSJLqw" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:56:30,972 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-15 12:43:04,085 - INFO - Bucket 'query-vector-search-testing' exists.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-15 12:43:04,101 - INFO - Collection 'cohere' already exists. Skipping creation.\n", + "2025-09-15 12:43:06,191 - INFO - All documents cleared from the collection.\n", + "2025-09-15 12:43:06,193 - INFO - Bucket 'query-vector-search-testing' exists.\n", + "2025-09-15 12:43:06,199 - INFO - Collection 'cache' already exists. Skipping creation.\n", + "2025-09-15 12:43:08,367 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LT3s8x_Mx3KG" + }, + "source": [ + "# Create Embeddings\n", + "Embeddings are created using the Cohere API. Embeddings are vectors (arrays of numbers) that represent the meaning of text in a high-dimensional space. These embeddings are crucial for tasks like semantic search, where the goal is to find text that is semantically similar to a query. The script uses a pre-trained model provided by Cohere to generate embeddings for the text in the TREC dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "A6fG7Mopx3Np" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:56:36,813 - INFO - Successfully created CohereEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = CohereEmbeddings(\n", + " cohere_api_key=COHERE_API_KEY,\n", + " model=\"embed-english-v3.0\",\n", + " )\n", + " logging.info(\"Successfully created CohereEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating CohereEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iar2fABrLJjK" + }, + "source": [ + "# Set Up Vector Store\n", + "The vector store is set up to manage the embeddings created in the previous step. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is built on top of Couchbase, allowing the script to store the embeddings in a way that can be efficiently searched.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "cjASXR3dLJgZ" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:56:39,259 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding = embeddings,\n", + " distance_metric=DistanceStrategy.COSINE\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-15 12:43:32,383 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of 50 articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "3. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 50 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-15 12:45:26,834 - INFO - Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 50\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GQpib0zKLJTh" + }, + "source": [ + "# Create Language Model (LLM)\n", + "The script initializes a Cohere language model (LLM) that will be used for generating responses to queries. LLMs are powerful tools for natural language understanding and generation, capable of producing human-like text based on input prompts. The model is configured with specific parameters, such as the temperature, which controls the randomness of its outputs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "7eV1X5xILJRC" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:58:23,399 - INFO - Successfully created Cohere LLM with model command\n" + ] + } + ], + "source": [ + "try:\n", + " llm = ChatCohere(\n", + " cohere_api_key=COHERE_API_KEY,\n", + " model=\"command-a-03-2025\",\n", + " temperature=0\n", + " )\n", + " logging.info(\"Successfully created Cohere LLM with model command\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating Cohere LLM: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wQ0fNbphbWpu" + }, + "source": [ + "# Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "udcxHyloyoxE" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:59:03,622 - INFO - Semantic search completed in 1.18 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 1.18 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3359, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", + "\n", + "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", + "\n", + "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3477, Text: 'We have to find a way' - Guardiola vows to end relegation form\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. 'Worrying' and 'staggering' - Why do Manchester City keep conceding?\n", + "\n", + "Manchester City are currently in relegation form and there is little sign of it ending. Saturday's 2-1 defeat at Aston Villa left them joint bottom of the form table over the past eight games with just Southampton for company. Saints, at the foot of the Premier League, have the same number of points, four, as City over their past eight matches having won one, drawn one and lost six - the same record as the floundering champions. And if Southampton - who appointed Ivan Juric as their new manager on Saturday - get at least a point at Fulham on Sunday, City will be on the worst run in the division. Even Wolves, who sacked boss Gary O'Neil last Sunday and replaced him with Vitor Pereira, have earned double the number of points during the same period having played a game fewer. They are damning statistics for Pep Guardiola, even if he does have some mitigating circumstances with injuries to Ederson, Nathan Ake and Ruben Dias - who all missed the loss at Villa Park - and the long-term loss of midfield powerhouse Rodri. Guardiola was happy with Saturday's performance, despite defeat in Birmingham, but there is little solace to take at slipping further out of the title race. He may have needed to field a half-fit Manuel Akanji and John Stones at Villa Park but that does not account for City looking a shadow of their former selves. That does not justify the error Josko Gvardiol made to gift Jhon Duran a golden chance inside the first 20 seconds, or \u00a3100m man Jack Grealish again failing to have an impact on a game. There may be legitimate reasons for City's drop off, whether that be injuries, mental fatigue or just simply a team coming to the end of its lifecycle, but their form, which has plunged off a cliff edge, would have been unthinkable as they strolled to a fourth straight title last season. \"The worrying thing is the number of goals conceded,\" said ex-England captain Alan Shearer on BBC Match of the Day. \"The number of times they were opened up because of the lack of protection and legs in midfield was staggering. There are so many things that are wrong at this moment in time.\"\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. Man City 'have to find a way' to return to form - Guardiola\n", + "\n", + "Afterwards Guardiola was calm, so much so it was difficult to hear him in the news conference, a contrast to the frustrated figure he cut on the touchline. He said: \"It depends on us. The solution is bring the players back. We have just one central defender fit, that is difficult. We are going to try next game - another opportunity and we don't think much further than that. \"Of course there are more reasons. We concede the goals we don't concede in the past, we [don't] score the goals we score in the past. Football is not just one reason. There are a lot of little factors. \"Last season we won the Premier League, but we came here and lost. We have to think positive and I have incredible trust in the guys. Some of them have incredible pride and desire to do it. We have to find a way, step by step, sooner or later to find a way back.\" Villa boss Unai Emery highlighted City's frailties, saying he felt Villa could seize on the visitors' lack of belief. \"Manchester City are a little bit under the confidence they have normally,\" he said. \"The second half was different, we dominated and we scored. Through those circumstances they were feeling worse than even in the first half.\"\n", + "\n", + "Erling Haaland had one touch in the Villa box\n", + "\n", + "There are chinks in the armour never seen before at City under Guardiola and Erling Haaland conceded belief within the squad is low. He told TNT after the game: \"Of course, [confidence levels are] not the best. We know how important confidence is and you can see that it affects every human being. That is how it is, we have to continue and stay positive even though it is difficult.\" Haaland, with 76 goals in 83 Premier League appearances since joining City from Borussia Dortmund in 2022, had one shot and one touch in the Villa box. His 18 touches in the whole game were the lowest of all starting players and he has been self critical, despite scoring 13 goals in the top flight this season. Over City's last eight games he has netted just twice though, but Guardiola refused to criticise his star striker. He said: \"Without him we will be even worse but I like the players feeling that way. I don't agree with Erling. He needs to have the balls delivered in the right spots but he will fight for the next one.\"\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3677, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", + "\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80) # Add separator line\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80) # Add separator between results\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", + "\n", + "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Couchbase Hyperscale and Composite Vector Indexes in Couchbase.\n", + "\n", + "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", + "\n", + "Hyperscale Vector Indexes (BHIVE)\n", + "- Best for pure vector searches - content discovery, recommendations, semantic search\n", + "- High performance with low memory footprint - designed to scale to billions of vectors\n", + "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", + "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", + "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", + "\n", + "Composite Vector Indexes \n", + "- Best for filtered vector searches - combines vector search with scalar value filtering\n", + "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", + "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", + "\n", + "Choosing the Right Index Type\n", + "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", + "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", + "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", + "\n", + "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", + "\n", + "\n", + "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", + "\n", + "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", + "\n", + "Format: `'IVF[],{PQ|SQ}'`\n", + "\n", + "Centroids (IVF - Inverted File):\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- More centroids = faster search, slower training \n", + "- Fewer centroids = slower search, faster training\n", + "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "\n", + "Quantization Options:\n", + "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- Higher values = better accuracy, larger index size\n", + "\n", + "Common Examples:\n", + "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", + "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", + "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, Hyperscale and Composite Vector indexes can be created manually from the Couchbase UI." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"cohere_bhive_index\",index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The example below shows running the same similarity search, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data.\n", + "\n", + "**Important**: When using Composite indexes, scalar filters take precedence over vector similarity, which can improve performance for filtered searches but may miss some semantically relevant results that don't match the scalar criteria.\n", + "\n", + "Note: In GSI vector search, the distance represents the vector distance between the query and document embeddings. Lower distance indicate higher similarity, while higher distance indicate lower similarity." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:59:26,949 - INFO - Semantic search completed in 0.38 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 0.38 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3359, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", + "\n", + "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", + "\n", + "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3477, Text: 'We have to find a way' - Guardiola vows to end relegation form\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. 'Worrying' and 'staggering' - Why do Manchester City keep conceding?\n", + "\n", + "Manchester City are currently in relegation form and there is little sign of it ending. Saturday's 2-1 defeat at Aston Villa left them joint bottom of the form table over the past eight games with just Southampton for company. Saints, at the foot of the Premier League, have the same number of points, four, as City over their past eight matches having won one, drawn one and lost six - the same record as the floundering champions. And if Southampton - who appointed Ivan Juric as their new manager on Saturday - get at least a point at Fulham on Sunday, City will be on the worst run in the division. Even Wolves, who sacked boss Gary O'Neil last Sunday and replaced him with Vitor Pereira, have earned double the number of points during the same period having played a game fewer. They are damning statistics for Pep Guardiola, even if he does have some mitigating circumstances with injuries to Ederson, Nathan Ake and Ruben Dias - who all missed the loss at Villa Park - and the long-term loss of midfield powerhouse Rodri. Guardiola was happy with Saturday's performance, despite defeat in Birmingham, but there is little solace to take at slipping further out of the title race. He may have needed to field a half-fit Manuel Akanji and John Stones at Villa Park but that does not account for City looking a shadow of their former selves. That does not justify the error Josko Gvardiol made to gift Jhon Duran a golden chance inside the first 20 seconds, or \u00a3100m man Jack Grealish again failing to have an impact on a game. There may be legitimate reasons for City's drop off, whether that be injuries, mental fatigue or just simply a team coming to the end of its lifecycle, but their form, which has plunged off a cliff edge, would have been unthinkable as they strolled to a fourth straight title last season. \"The worrying thing is the number of goals conceded,\" said ex-England captain Alan Shearer on BBC Match of the Day. \"The number of times they were opened up because of the lack of protection and legs in midfield was staggering. There are so many things that are wrong at this moment in time.\"\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. Man City 'have to find a way' to return to form - Guardiola\n", + "\n", + "Afterwards Guardiola was calm, so much so it was difficult to hear him in the news conference, a contrast to the frustrated figure he cut on the touchline. He said: \"It depends on us. The solution is bring the players back. We have just one central defender fit, that is difficult. We are going to try next game - another opportunity and we don't think much further than that. \"Of course there are more reasons. We concede the goals we don't concede in the past, we [don't] score the goals we score in the past. Football is not just one reason. There are a lot of little factors. \"Last season we won the Premier League, but we came here and lost. We have to think positive and I have incredible trust in the guys. Some of them have incredible pride and desire to do it. We have to find a way, step by step, sooner or later to find a way back.\" Villa boss Unai Emery highlighted City's frailties, saying he felt Villa could seize on the visitors' lack of belief. \"Manchester City are a little bit under the confidence they have normally,\" he said. \"The second half was different, we dominated and we scored. Through those circumstances they were feeling worse than even in the first half.\"\n", + "\n", + "Erling Haaland had one touch in the Villa box\n", + "\n", + "There are chinks in the armour never seen before at City under Guardiola and Erling Haaland conceded belief within the squad is low. He told TNT after the game: \"Of course, [confidence levels are] not the best. We know how important confidence is and you can see that it affects every human being. That is how it is, we have to continue and stay positive even though it is difficult.\" Haaland, with 76 goals in 83 Premier League appearances since joining City from Borussia Dortmund in 2022, had one shot and one touch in the Villa box. His 18 touches in the whole game were the lowest of all starting players and he has been self critical, despite scoring 13 goals in the top flight this season. Over City's last eight games he has netted just twice though, but Guardiola refused to criticise his star striker. He said: \"Without him we will be even worse but I like the players feeling that way. I don't agree with Erling. He needs to have the balls delivered in the right spots but he will fight for the next one.\"\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3677, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", + "\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80) # Add separator line\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80) # Add separator between results\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: To create a COMPOSITE index, the below code can be used.\n", + "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"cohere_composite_index\", index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Set Up Cache\n", + " A cache is set up using Couchbase to store intermediate results and frequently accessed data. Caching is important for improving performance, as it reduces the need to repeatedly calculate or retrieve the same data. The cache is linked to a specific collection in Couchbase, and it is used later in the script to store the results of language model queries.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-22 12:59:40,381 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bt44X6-bLJOb" + }, + "source": [ + "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "id": "6cGJfwS2LI_O" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-15 12:53:46,979 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "try:\n", + " template = \"\"\"You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:\n", + " {context}\n", + "\n", + " Question: {question}\"\"\"\n", + " prompt = ChatPromptTemplate.from_template(template)\n", + "\n", + " rag_chain = (\n", + " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + " )\n", + " logging.info(\"Successfully created RAG chain\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating RAG chain: {str(e)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "id": "PvuJyXPUFOux" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: Manchester City manager Pep Guardiola has expressed concern and frustration over the team's recent form, describing it as the \"worst run of results\" in his managerial career. He has admitted that the situation has affected his sleep and diet, stating that his state of mind is \"ugly\" and his sleep is \"worse.\" Guardiola has also acknowledged the need for the team to defend better and avoid making mistakes at both ends of the pitch. Despite the challenges, he remains focused on finding solutions and has emphasized the importance of bringing injured players back to the squad. Guardiola has also highlighted the need for the team to recover its essence by improving defensive concepts and re-establishing the intensity they are known for. He has taken a self-critical approach, stating that he is \"not good enough\" to resolve the situation with the current group of players and has vowed to find solutions to turn the team's form around.\n", + "RAG response generated in 4.09 seconds\n" + ] + } + ], + "source": [ + "start_time = time.time()\n", + "try:\n", + " rag_response = rag_chain.invoke(query)\n", + " rag_elapsed_time = time.time() - start_time\n", + " print(f\"RAG Response: {rag_response}\")\n", + " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cUXEVXyxGlv2" + }, + "source": [ + "# Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "id": "J_PaTD2aGmGt" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: What happened in the match between Fullham and Liverpool?\n", + "Response: In the match between Fulham and Liverpool, Liverpool played with 10 men for 89 minutes after Andy Robertson received a red card in the 17th minute. Despite this numerical disadvantage, Liverpool managed to secure a 2-2 draw at Anfield. Fulham took the lead twice, but Liverpool responded both times, with Diogo Jota scoring an 86th-minute equalizer. The performance highlighted Liverpool's resilience and title credentials, with Fulham's Antonee Robinson praising Liverpool for not seeming like they were a man down. Liverpool maintained over 60% possession and dominated attacking metrics, showcasing their ability to fight back under adversity.\n", + "Time taken: 2.12 seconds\n", + "\n", + "Query 2: What was manchester city manager pep guardiola's reaction to the team's current form?\n", + "Response: Manchester City manager Pep Guardiola has expressed concern and frustration over the team's recent form, describing it as the \"worst run of results\" in his managerial career. He has admitted that the situation has affected his sleep and diet, stating that his state of mind is \"ugly\" and his sleep is \"worse.\" Guardiola has also acknowledged the need for the team to defend better and avoid making mistakes at both ends of the pitch. Despite the challenges, he remains focused on finding solutions and has emphasized the importance of bringing injured players back to the squad. Guardiola has also highlighted the need for the team to recover its essence by improving defensive concepts and re-establishing the intensity they are known for. He has taken a self-critical approach, stating that he is \"not good enough\" to resolve the situation with the current group of players and has vowed to find solutions to turn the team's form around.\n", + "Time taken: 0.35 seconds\n", + "\n", + "Query 3: What happened in the match between Fullham and Liverpool?\n", + "Response: In the match between Fulham and Liverpool, Liverpool played with 10 men for 89 minutes after Andy Robertson received a red card in the 17th minute. Despite this numerical disadvantage, Liverpool managed to secure a 2-2 draw at Anfield. Fulham took the lead twice, but Liverpool responded both times, with Diogo Jota scoring an 86th-minute equalizer. The performance highlighted Liverpool's resilience and title credentials, with Fulham's Antonee Robinson praising Liverpool for not seeming like they were a man down. Liverpool maintained over 60% possession and dominated attacking metrics, showcasing their ability to fight back under adversity.\n", + "Time taken: 0.35 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"What happened in the match between Fullham and Liverpool?\",\n", + " \"What was manchester city manager pep guardiola's reaction to the team's current form?\", # Repeated query\n", + " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response}\")\n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and Cohere. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using Hyperscale and Composite Vector Indexes which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "accelerator": "TPU", + "colab": { + "gpuType": "V28", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/cohere/gsi/frontmatter.md b/cohere/query_based/frontmatter.md similarity index 100% rename from cohere/gsi/frontmatter.md rename to cohere/query_based/frontmatter.md diff --git a/cohere/fts/.env.sample b/cohere/search_based/.env.sample similarity index 100% rename from cohere/fts/.env.sample rename to cohere/search_based/.env.sample diff --git a/cohere/search_based/RAG_with_Couchbase_and_Cohere.ipynb b/cohere/search_based/RAG_with_Couchbase_and_Cohere.ipynb new file mode 100644 index 00000000..6ce8b4c0 --- /dev/null +++ b/cohere/search_based/RAG_with_Couchbase_and_Cohere.ipynb @@ -0,0 +1,1019 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "zAPY14a2BOhq" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Cohere](https://cohere.com/)\n", + " as the AI-powered embedding and language model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Search Vector Index from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-cohere-couchbase-rag-with-hyperscale-or-composite-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/cohere/fts/RAG_with_Couchbase_and_Cohere.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for Cohere\n", + "\n", + "Please follow the [instructions](https://dashboard.cohere.com/welcome/register) to generate the Cohere credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EYZzrd_tBdUC" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "cYUkZqeoEykk" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.3.0 langchain-cohere==0.4.4 python-dotenv==1.1.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dw3IL3GEJSj7" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "oziN03NZJLQw" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "from uuid import uuid4\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + " ServiceUnavailableException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_cohere import ChatCohere, CohereEmbeddings\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iXwzTRdbCLL1" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "R-SanCZrCLdm" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s',force=True)\n", + "\n", + "# Supress Excessive logging\n", + "logging.getLogger('openai').setLevel(logging.WARNING)\n", + "logging.getLogger('httpx').setLevel(logging.WARNING)\n", + "logging.getLogger('langchain_cohere').setLevel(logging.ERROR)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zOwSwRoHJLXv" + }, + "source": [ + "# Loading Sensitive Informnation\n", + "In this section, we prompt the user to input essential configuration settings needed for integrating Couchbase with Cohere's API. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "y2H9xphrJLbP" + }, + "outputs": [], + "source": [ + "load_dotenv()\n", + "\n", + "COHERE_API_KEY = os.getenv('COHERE_API_KEY') or getpass.getpass('Enter your Cohere API key: ')\n", + "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", + "INDEX_NAME = os.getenv('INDEX_NAME') or input('Enter your index name (default: vector_search_cohere): ') or 'vector_search_cohere'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: cohere): ') or 'cohere'\n", + "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not COHERE_API_KEY:\n", + " raise ValueError(\"COHERE_API_KEY is not provided and is required.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sdKdLg9pJLl5" + }, + "source": [ + "# Connect to Couchbase\n", + "The script attempts to establish a connection to the Couchbase database using the credentials retrieved from the environment variables. Couchbase is a NoSQL database known for its flexibility, scalability, and support for various data models, including document-based storage. The connection is authenticated using a username and password, and the script waits until the connection is fully established before proceeding.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "HubiGMCSJLqw" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:27:13,562 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Creates primary index on collection for query performance\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:27:14,806 - INFO - Bucket 'vector-search-testing' exists.\n", + "2025-02-06 01:27:17,199 - INFO - Collection 'cohere' already exists. Skipping creation.\n", + "2025-02-06 01:27:20,585 - INFO - Primary index present or created successfully.\n", + "2025-02-06 01:27:20,888 - INFO - All documents cleared from the collection.\n", + "2025-02-06 01:27:20,889 - INFO - Bucket 'vector-search-testing' exists.\n", + "2025-02-06 01:27:23,271 - INFO - Collection 'cache' already exists. Skipping creation.\n", + "2025-02-06 01:27:26,258 - INFO - Primary index present or created successfully.\n", + "2025-02-06 01:27:26,497 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Ensure primary index exists\n", + " try:\n", + " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", + " logging.info(\"Primary index present or created successfully.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error creating primary index: {str(e)}\")\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j4tYSkkDxS9O" + }, + "source": [ + "# Loading Couchbase Vector Search Index\n", + "\n", + "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", + "\n", + "This Cohere vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `cohere`. The configuration is set up for vectors with exactly `1024 dimensions`, using dot product similarity and optimized for recall. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", + "\n", + "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "szXN-oNGxTMF" + }, + "outputs": [], + "source": [ + "# If you are running this script locally (not in Google Colab), uncomment the following line\n", + "# and provide the path to your index definition file.\n", + "\n", + "# index_definition_path = '/path_to_your_index_file/cohere_index.json' # Local setup: specify your file path here\n", + "\n", + "# # Version for Google Colab\n", + "# def load_index_definition_colab():\n", + "# from google.colab import files\n", + "# print(\"Upload your index definition file\")\n", + "# uploaded = files.upload()\n", + "# index_definition_path = list(uploaded.keys())[0]\n", + "\n", + "# try:\n", + "# with open(index_definition_path, 'r') as file:\n", + "# index_definition = json.load(file)\n", + "# return index_definition\n", + "# except Exception as e:\n", + "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Version for Local Environment\n", + "def load_index_definition_local(index_definition_path):\n", + " try:\n", + " with open(index_definition_path, 'r') as file:\n", + " index_definition = json.load(file)\n", + " return index_definition\n", + " except Exception as e:\n", + " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Usage\n", + "# Uncomment the appropriate line based on your environment\n", + "# index_definition = load_index_definition_colab()\n", + "index_definition = load_index_definition_local('cohere_index.json')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TXGj5YokJLuU" + }, + "source": [ + "# Creating or Updating Search Indexes\n", + "\n", + "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "VHeB_AVmLJlx" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:27:27,729 - INFO - Index 'vector_search_cohere' found\n", + "2025-02-06 01:27:28,595 - INFO - Index 'vector_search_cohere' already exists. Skipping creation/update.\n" + ] + } + ], + "source": [ + "try:\n", + " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", + "\n", + " # Check if index already exists\n", + " existing_indexes = scope_index_manager.get_all_indexes()\n", + " index_name = index_definition[\"name\"]\n", + "\n", + " if index_name in [index.name for index in existing_indexes]:\n", + " logging.info(f\"Index '{index_name}' found\")\n", + " else:\n", + " logging.info(f\"Creating new index '{index_name}'...\")\n", + "\n", + " # Create SearchIndex object from JSON definition\n", + " search_index = SearchIndex.from_json(index_definition)\n", + "\n", + " # Upsert the index (create if not exists, update if exists)\n", + " scope_index_manager.upsert_index(search_index)\n", + " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", + "\n", + "except QueryIndexAlreadyExistsException:\n", + " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", + "except ServiceUnavailableException:\n", + " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", + "except InternalServerFailureException as e:\n", + " logging.error(f\"Internal server error: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LT3s8x_Mx3KG" + }, + "source": [ + "# Create Embeddings\n", + "Embeddings are created using the Cohere API. Embeddings are vectors (arrays of numbers) that represent the meaning of text in a high-dimensional space. These embeddings are crucial for tasks like semantic search, where the goal is to find text that is semantically similar to a query. The script uses a pre-trained model provided by Cohere to generate embeddings for the text in the TREC dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "A6fG7Mopx3Np" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:27:28,613 - INFO - Successfully created CohereEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = CohereEmbeddings(\n", + " cohere_api_key=COHERE_API_KEY,\n", + " model=\"embed-english-v3.0\",\n", + " )\n", + " logging.info(\"Successfully created CohereEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating CohereEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iar2fABrLJjK" + }, + "source": [ + "# Set Up Vector Store\n", + "The vector store is set up to manage the embeddings created in the previous step. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is built on top of Couchbase, allowing the script to store the embeddings in a way that can be efficiently searched.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "cjASXR3dLJgZ" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:27:32,177 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " index_name=INDEX_NAME,\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:27:38,003 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of 50 articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "3. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 50 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:29:07,077 - INFO - Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 50\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ToQ2acrSLJY7" + }, + "source": [ + "# Set Up Cache\n", + " A cache is set up using Couchbase to store intermediate results and frequently accessed data. Caching is important for improving performance, as it reduces the need to repeatedly calculate or retrieve the same data. The cache is linked to a specific collection in Couchbase, and it is used later in the script to store the results of language model queries.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "qZDXvq88LJWH" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:30:37,657 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GQpib0zKLJTh" + }, + "source": [ + "# Create Language Model (LLM)\n", + "The script initializes a Cohere language model (LLM) that will be used for generating responses to queries. LLMs are powerful tools for natural language understanding and generation, capable of producing human-like text based on input prompts. The model is configured with specific parameters, such as the temperature, which controls the randomness of its outputs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7eV1X5xILJRC" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:30:38,684 - INFO - Successfully created Cohere LLM with model command\n" + ] + } + ], + "source": [ + "try:\n", + " llm = ChatCohere(\n", + " cohere_api_key=COHERE_API_KEY,\n", + " model=\"command-a-03-2025\",\n", + " temperature=0\n", + " )\n", + " logging.info(\"Successfully created Cohere LLM with model command\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating Cohere LLM: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wQ0fNbphbWpu" + }, + "source": [ + "# Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. \n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the similarity_search_with_score method of the CouchbaseSearchVectorStore. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "udcxHyloyoxE" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:30:43,101 - INFO - Semantic search completed in 1.89 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 1.89 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.6641, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", + "\n", + "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", + "\n", + "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.6521, Text: 'We have to find a way' - Guardiola vows to end relegation form\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. 'Worrying' and 'staggering' - Why do Manchester City keep conceding?\n", + "\n", + "Manchester City are currently in relegation form and there is little sign of it ending. Saturday's 2-1 defeat at Aston Villa left them joint bottom of the form table over the past eight games with just Southampton for company. Saints, at the foot of the Premier League, have the same number of points, four, as City over their past eight matches having won one, drawn one and lost six - the same record as the floundering champions. And if Southampton - who appointed Ivan Juric as their new manager on Saturday - get at least a point at Fulham on Sunday, City will be on the worst run in the division. Even Wolves, who sacked boss Gary O'Neil last Sunday and replaced him with Vitor Pereira, have earned double the number of points during the same period having played a game fewer. They are damning statistics for Pep Guardiola, even if he does have some mitigating circumstances with injuries to Ederson, Nathan Ake and Ruben Dias - who all missed the loss at Villa Park - and the long-term loss of midfield powerhouse Rodri. Guardiola was happy with Saturday's performance, despite defeat in Birmingham, but there is little solace to take at slipping further out of the title race. He may have needed to field a half-fit Manuel Akanji and John Stones at Villa Park but that does not account for City looking a shadow of their former selves. That does not justify the error Josko Gvardiol made to gift Jhon Duran a golden chance inside the first 20 seconds, or \u00a3100m man Jack Grealish again failing to have an impact on a game. There may be legitimate reasons for City's drop off, whether that be injuries, mental fatigue or just simply a team coming to the end of its lifecycle, but their form, which has plunged off a cliff edge, would have been unthinkable as they strolled to a fourth straight title last season. \"The worrying thing is the number of goals conceded,\" said ex-England captain Alan Shearer on BBC Match of the Day. \"The number of times they were opened up because of the lack of protection and legs in midfield was staggering. There are so many things that are wrong at this moment in time.\"\n", + "\n", + "This video can not be played To play this video you need to enable JavaScript in your browser. Man City 'have to find a way' to return to form - Guardiola\n", + "\n", + "Afterwards Guardiola was calm, so much so it was difficult to hear him in the news conference, a contrast to the frustrated figure he cut on the touchline. He said: \"It depends on us. The solution is bring the players back. We have just one central defender fit, that is difficult. We are going to try next game - another opportunity and we don't think much further than that. \"Of course there are more reasons. We concede the goals we don't concede in the past, we [don't] score the goals we score in the past. Football is not just one reason. There are a lot of little factors. \"Last season we won the Premier League, but we came here and lost. We have to think positive and I have incredible trust in the guys. Some of them have incredible pride and desire to do it. We have to find a way, step by step, sooner or later to find a way back.\" Villa boss Unai Emery highlighted City's frailties, saying he felt Villa could seize on the visitors' lack of belief. \"Manchester City are a little bit under the confidence they have normally,\" he said. \"The second half was different, we dominated and we scored. Through those circumstances they were feeling worse than even in the first half.\"\n", + "\n", + "Erling Haaland had one touch in the Villa box\n", + "\n", + "There are chinks in the armour never seen before at City under Guardiola and Erling Haaland conceded belief within the squad is low. He told TNT after the game: \"Of course, [confidence levels are] not the best. We know how important confidence is and you can see that it affects every human being. That is how it is, we have to continue and stay positive even though it is difficult.\" Haaland, with 76 goals in 83 Premier League appearances since joining City from Borussia Dortmund in 2022, had one shot and one touch in the Villa box. His 18 touches in the whole game were the lowest of all starting players and he has been self critical, despite scoring 13 goals in the top flight this season. Over City's last eight games he has netted just twice though, but Guardiola refused to criticise his star striker. He said: \"Without him we will be even worse but I like the players feeling that way. I don't agree with Erling. He needs to have the balls delivered in the right spots but he will fight for the next one.\"\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.6322, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", + "\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80) # Add separator line\n", + " for doc, score in search_results:\n", + " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80) # Add separator between results\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bt44X6-bLJOb" + }, + "source": [ + "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "6cGJfwS2LI_O" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-06 01:30:46,088 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "try:\n", + " template = \"\"\"You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:\n", + " {context}\n", + "\n", + " Question: {question}\"\"\"\n", + " prompt = ChatPromptTemplate.from_template(template)\n", + "\n", + " rag_chain = (\n", + " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + " )\n", + " logging.info(\"Successfully created RAG chain\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating RAG chain: {str(e)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "PvuJyXPUFOux" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: Manchester City manager Pep Guardiola has been open about the impact the team's poor form has had on him personally. He has admitted that his sleep and diet have been affected, and that he has been feeling \"ugly\" and uncomfortable. Guardiola has also been giving a lot of thought to the reasons for the team's decline, talking to many people and trying to work out the causes. He has been very protective of his players, refusing to criticise them and instead giving them more days off to clear their heads.\n", + "\n", + "Guardiola has also been very self-critical, saying that he is \"not good enough\" and that he needs to find solutions to the team's problems. He has acknowledged that the team is not performing as well as it used to, and that there are many factors contributing to their poor form, including injuries, mental fatigue, and a lack of confidence. He has also suggested that the team needs to improve its defensive concepts and re-establish its intensity.\n", + "\n", + "Overall, Guardiola seems to be taking a very hands-on approach to the team's struggles, trying to find solutions and protect his players while also being very honest about his own role in the situation.\n", + "RAG response generated in 9.52 seconds\n" + ] + } + ], + "source": [ + "start_time = time.time()\n", + "try:\n", + " rag_response = rag_chain.invoke(query)\n", + " rag_elapsed_time = time.time() - start_time\n", + " print(f\"RAG Response: {rag_response}\")\n", + " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cUXEVXyxGlv2" + }, + "source": [ + "# Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "id": "J_PaTD2aGmGt" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: What happened in the match between Fullham and Liverpool?\n", + "Response: Liverpool and Fulham played out a thrilling 2-2 draw at Anfield. Liverpool were reduced to 10 men after Andy Robertson was sent off in the 17th minute, but they fought back twice to earn a point. The Reds dominated the match despite their numerical disadvantage, with over 60% possession and leading in several attacking metrics. Diogo Jota scored the equaliser in the 86th minute, capping off an impressive performance that showcased Liverpool's title credentials.\n", + "Time taken: 5.29 seconds\n", + "\n", + "Query 2: What was manchester city manager pep guardiola's reaction to the team's current form?\n", + "Response: Manchester City manager Pep Guardiola has been open about the impact the team's poor form has had on him personally. He has admitted that his sleep and diet have been affected, and that he has been feeling \"ugly\" and uncomfortable. Guardiola has also been giving a lot of thought to the reasons for the team's decline, talking to many people and trying to work out the causes. He has been very protective of his players, refusing to criticise them and instead giving them more days off to clear their heads.\n", + "\n", + "Guardiola has also been very self-critical, saying that he is \"not good enough\" and that he needs to find solutions to the team's problems. He has acknowledged that the team is not performing as well as it used to, and that there are many factors contributing to their poor form, including injuries, mental fatigue, and a lack of confidence. He has also suggested that the team needs to improve its defensive concepts and re-establish its intensity.\n", + "\n", + "Overall, Guardiola seems to be taking a very hands-on approach to the team's struggles, trying to find solutions and protect his players while also being very honest about his own role in the situation.\n", + "Time taken: 2.13 seconds\n", + "\n", + "Query 3: What happened in the match between Fullham and Liverpool?\n", + "Response: Liverpool and Fulham played out a thrilling 2-2 draw at Anfield. Liverpool were reduced to 10 men after Andy Robertson was sent off in the 17th minute, but they fought back twice to earn a point. The Reds dominated the match despite their numerical disadvantage, with over 60% possession and leading in several attacking metrics. Diogo Jota scored the equaliser in the 86th minute, capping off an impressive performance that showcased Liverpool's title credentials.\n", + "Time taken: 1.36 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"What happened in the match between Fullham and Liverpool?\",\n", + " \"What was manchester city manager pep guardiola's reaction to the team's current form?\", # Repeated query\n", + " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response}\")\n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and Cohere. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "accelerator": "TPU", + "colab": { + "gpuType": "V28", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/cohere/fts/cohere_index.json b/cohere/search_based/cohere_index.json similarity index 100% rename from cohere/fts/cohere_index.json rename to cohere/search_based/cohere_index.json diff --git a/cohere/fts/frontmatter.md b/cohere/search_based/frontmatter.md similarity index 100% rename from cohere/fts/frontmatter.md rename to cohere/search_based/frontmatter.md diff --git a/crewai-short-term-memory/fts/CouchbaseStorage_Demo.ipynb b/crewai-short-term-memory/fts/CouchbaseStorage_Demo.ipynb deleted file mode 100644 index de6676af..00000000 --- a/crewai-short-term-memory/fts/CouchbaseStorage_Demo.ipynb +++ /dev/null @@ -1,1212 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# CrewAI with Couchbase Short-Term Memory\n", - "\n", - "This notebook demonstrates how to implement a custom storage backend for CrewAI's memory system using Couchbase and vector search. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-crewai-short-term-memory-couchbase-with-global-secondary-index)\n", - "\n", - "Here's a breakdown of each section:\n", - "\n", - "How to run this tutorial\n", - "----------------------\n", - "This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run \n", - "interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/crewai-short-term-memory/fts/CouchbaseStorage_Demo.ipynb).\n", - "\n", - "You can either:\n", - "- Download the notebook file and run it on [Google Colab](https://colab.research.google.com)\n", - "- Run it on your system by setting up the Python environment\n", - "\n", - "Before you start\n", - "---------------\n", - "\n", - "1. Create and Deploy Your Free Tier Operational cluster on [Capella](https://cloud.couchbase.com/sign-up)\n", - " - To get started with [Couchbase Capella](https://cloud.couchbase.com), create an account and use it to deploy \n", - " a forever free tier operational cluster\n", - " - This account provides you with an environment where you can explore and learn \n", - " about Capella with no time constraint\n", - " - To learn more, please follow the [Getting Started Guide](https://docs.couchbase.com/cloud/get-started/create-account.html)\n", - "\n", - "2. Couchbase Capella Configuration\n", - " When running Couchbase using Capella, the following prerequisites need to be met:\n", - " - Create the database credentials to access the required bucket (Read and Write) \n", - " used in the application\n", - " - Allow access to the Cluster from the IP on which the application is running by following the [Network Security documentation](https://docs.couchbase.com/cloud/security/security.html#public-access)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Memory in AI Agents\n", - "\n", - "Memory in AI agents is a crucial capability that allows them to retain and utilize information across interactions, making them more effective and contextually aware. Without memory, agents would be limited to processing only the immediate input, lacking the ability to build upon past experiences or maintain continuity in conversations.\n", - "\n", - "> Note: This section on memory types and functionality is adapted from the CrewAI documentation.\n", - "\n", - "## Types of Memory in AI Agents\n", - "\n", - "### Short-term Memory\n", - "- Retains recent interactions and context\n", - "- Typically spans the current conversation or session \n", - "- Helps maintain coherence within a single interaction flow\n", - "- In CrewAI, this is what we're implementing with the Couchbase storage\n", - "\n", - "### Long-term Memory\n", - "- Stores persistent knowledge across multiple sessions\n", - "- Enables agents to recall past interactions even after long periods\n", - "- Helps build cumulative knowledge about users, preferences, and past decisions\n", - "- While this implementation is labeled as \"short-term memory\", the Couchbase storage backend can be effectively used for long-term memory as well, thanks to Couchbase's persistent storage capabilities and enterprise-grade durability features\n", - "\n", - "\n", - "\n", - "## How Memory Works in Agents\n", - "Memory in AI agents typically involves:\n", - "- Storage: Information is encoded and stored in a database (like Couchbase, ChromaDB, or other vector stores)\n", - "- Retrieval: Relevant memories are fetched based on semantic similarity to current context\n", - "- Integration: Retrieved memories are incorporated into the agent's reasoning process\n", - "\n", - "In the CrewAI example, the CouchbaseStorage class implements:\n", - "- save(): Stores new memories with metadata\n", - "- search(): Retrieves relevant memories based on semantic similarity\n", - "- reset(): Clears stored memories when needed\n", - "\n", - "## Benefits of Memory in AI Agents\n", - "- Contextual Understanding: Agents can refer to previous parts of a conversation\n", - "- Personalization: Remembering user preferences and past interactions\n", - "- Learning and Adaptation: Building knowledge over time to improve responses\n", - "- Task Continuity: Resuming complex tasks across multiple interactions\n", - "- Collaboration: In multi-agent systems like CrewAI, memory enables agents to build on each other's work\n", - "\n", - "## Memory in CrewAI Specifically\n", - "In CrewAI, memory serves several important functions:\n", - "- Agent Specialization: Each agent can maintain its own memory relevant to its expertise\n", - "- Knowledge Transfer: Agents can share insights through memory when collaborating on tasks\n", - "- Process Continuity: In sequential processes, later agents can access the work of earlier agents\n", - "- Contextual Awareness: Agents can reference previous findings when making decisions\n", - "\n", - "The vector-based approach (using embeddings) is particularly powerful because it allows for semantic search - finding memories that are conceptually related to the current context, not just exact keyword matches.\n", - "\n", - "By implementing custom storage like Couchbase, you gain additional benefits like persistence, scalability, and the ability to leverage enterprise-grade database features for your agent memory systems." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Install Required Libraries\n", - "\n", - "This section installs the necessary Python packages:\n", - "- `crewai`: The main CrewAI framework\n", - "- `langchain-couchbase`: LangChain integration for Couchbase\n", - "- `langchain-openai`: LangChain integration for OpenAI\n", - "- `python-dotenv`: For loading environment variables" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet crewai==0.186.1 langchain-couchbase==0.4.0 langchain-openai==0.3.33 python-dotenv==1.1.1" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Importing Necessary Libraries\n", - "\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "from typing import Any, Dict, List, Optional\n", - "import os\n", - "import logging\n", - "from datetime import timedelta\n", - "from dotenv import load_dotenv\n", - "from crewai.memory.storage.rag_storage import RAGStorage\n", - "from crewai.memory.short_term.short_term_memory import ShortTermMemory\n", - "from crewai import Agent, Crew, Task, Process\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.options import ClusterOptions\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.diagnostics import PingState, ServiceType\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", - "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n", - "import time\n", - "import json\n", - "import uuid\n", - "\n", - "# Configure logging (disabled)\n", - "logging.basicConfig(level=logging.CRITICAL)\n", - "logger = logging.getLogger(__name__)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Loading Sensitive Information\n", - "\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script uses environment variables to store sensitive information, enhancing the overall security and maintainability of your code by avoiding hardcoded values.\n", - "\n", - "### Setting Up Environment Variables\n", - "\n", - "> **Note:** This implementation reads configuration parameters from environment variables. Before running this notebook, you need to set the following environment variables:\n", - ">\n", - "> - `OPENAI_API_KEY`: Your OpenAI API key for generating embeddings\n", - "> - `CB_HOST`: Couchbase cluster connection string (e.g., \"couchbases://cb.example.com\")\n", - "> - `CB_USERNAME`: Username for Couchbase authentication\n", - "> - `CB_PASSWORD`: Password for Couchbase authentication\n", - "> - `CB_BUCKET_NAME` (optional): Bucket name (defaults to \"vector-search-testing\")\n", - "> - `SCOPE_NAME` (optional): Scope name (defaults to \"shared\")\n", - "> - `COLLECTION_NAME` (optional): Collection name (defaults to \"crew\")\n", - "> - `INDEX_NAME` (optional): Vector search index name (defaults to \"vector_search_crew\")\n", - ">\n", - "> You can set these variables in a `.env` file in the same directory as this notebook, or set them directly in your environment." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "load_dotenv(\"./.env\")\n", - "\n", - "# Verify environment variables\n", - "required_vars = ['OPENAI_API_KEY', 'CB_HOST', 'CB_USERNAME', 'CB_PASSWORD']\n", - "for var in required_vars:\n", - " if not os.getenv(var):\n", - " raise ValueError(f\"{var} environment variable is required\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Implement CouchbaseStorage\n", - "\n", - "This section demonstrates the implementation of a custom vector storage solution using Couchbase:\n", - "\n", - "> **Note on Implementation:** This example uses the LangChain Couchbase integration (`langchain_couchbase`) for simplicity and to demonstrate integration with the broader LangChain ecosystem. In production environments, you may want to use the Couchbase SDK directly for better performance and more control.\n", - "\n", - "> For more information on using the Couchbase SDK directly, refer to:\n", - "> - [Couchbase Python SDK Documentation](https://docs.couchbase.com/python-sdk/current/howtos/full-text-searching-with-sdk.html#single-vector-query)" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "class CouchbaseStorage(RAGStorage):\n", - " \"\"\"\n", - " Extends RAGStorage to handle embeddings for memory entries using Couchbase.\n", - " \"\"\"\n", - "\n", - " def __init__(self, type: str, allow_reset: bool = True, embedder_config: Optional[Dict[str, Any]] = None, crew: Optional[Any] = None):\n", - " \"\"\"Initialize CouchbaseStorage with configuration.\"\"\"\n", - " super().__init__(type, allow_reset, embedder_config, crew)\n", - " self._initialize_app()\n", - "\n", - " def search(\n", - " self,\n", - " query: str,\n", - " limit: int = 3,\n", - " filter: Optional[dict] = None,\n", - " score_threshold: float = 0,\n", - " ) -> List[Dict[str, Any]]:\n", - " \"\"\"\n", - " Search memory entries using vector similarity.\n", - " \"\"\"\n", - " try:\n", - " # Add type filter\n", - " search_filter = {\"memory_type\": self.type}\n", - " if filter:\n", - " search_filter.update(filter)\n", - "\n", - " # Execute search\n", - " results = self.vector_store.similarity_search_with_score(\n", - " query,\n", - " k=limit,\n", - " filter=search_filter\n", - " )\n", - " \n", - " # Format results and deduplicate by content\n", - " seen_contents = set()\n", - " formatted_results = []\n", - " \n", - " for i, (doc, score) in enumerate(results):\n", - " if score >= score_threshold:\n", - " content = doc.page_content\n", - " if content not in seen_contents:\n", - " seen_contents.add(content)\n", - " formatted_results.append({\n", - " \"id\": doc.metadata.get(\"memory_id\", str(i)),\n", - " \"metadata\": doc.metadata,\n", - " \"context\": content,\n", - " \"score\": float(score)\n", - " })\n", - " \n", - " logger.info(f\"Found {len(formatted_results)} unique results for query: {query}\")\n", - " return formatted_results\n", - "\n", - " except Exception as e:\n", - " logger.error(f\"Search failed: {str(e)}\")\n", - " return []\n", - "\n", - " def save(self, value: Any, metadata: Dict[str, Any]) -> None:\n", - " \"\"\"\n", - " Save a memory entry with metadata.\n", - " \"\"\"\n", - " try:\n", - " # Generate unique ID\n", - " memory_id = str(uuid.uuid4())\n", - " timestamp = int(time.time() * 1000)\n", - " \n", - " # Prepare metadata (create a copy to avoid modifying references)\n", - " if not metadata:\n", - " metadata = {}\n", - " else:\n", - " metadata = metadata.copy() # Create a copy to avoid modifying references\n", - " \n", - " # Process agent-specific information if present\n", - " agent_name = metadata.get('agent', 'unknown')\n", - " \n", - " # Clean up value if it has the typical LLM response format\n", - " value_str = str(value)\n", - " if \"Final Answer:\" in value_str:\n", - " # Extract just the actual content - everything after \"Final Answer:\"\n", - " parts = value_str.split(\"Final Answer:\", 1)\n", - " if len(parts) > 1:\n", - " value = parts[1].strip()\n", - " logger.info(f\"Cleaned up response format for agent: {agent_name}\")\n", - " elif value_str.startswith(\"Thought:\"):\n", - " # Handle thought/final answer format\n", - " if \"Final Answer:\" in value_str:\n", - " parts = value_str.split(\"Final Answer:\", 1)\n", - " if len(parts) > 1:\n", - " value = parts[1].strip()\n", - " logger.info(f\"Cleaned up thought process format for agent: {agent_name}\")\n", - " \n", - " # Update metadata\n", - " metadata.update({\n", - " \"memory_id\": memory_id,\n", - " \"memory_type\": self.type,\n", - " \"timestamp\": timestamp,\n", - " \"source\": \"crewai\"\n", - " })\n", - "\n", - " # Log memory information for debugging\n", - " value_preview = str(value)[:100] + \"...\" if len(str(value)) > 100 else str(value)\n", - " metadata_preview = {k: v for k, v in metadata.items() if k != \"embedding\"}\n", - " logger.info(f\"Saving memory for Agent: {agent_name}\")\n", - " logger.info(f\"Memory value preview: {value_preview}\")\n", - " logger.info(f\"Memory metadata: {metadata_preview}\")\n", - " \n", - " # Convert value to string if needed\n", - " if isinstance(value, (dict, list)):\n", - " value = json.dumps(value)\n", - " elif not isinstance(value, str):\n", - " value = str(value)\n", - "\n", - " # Save to vector store\n", - " self.vector_store.add_texts(\n", - " texts=[value],\n", - " metadatas=[metadata],\n", - " ids=[memory_id]\n", - " )\n", - " logger.info(f\"Saved memory {memory_id}: {value[:100]}...\")\n", - "\n", - " except Exception as e:\n", - " logger.error(f\"Save failed: {str(e)}\")\n", - " raise\n", - "\n", - " def reset(self) -> None:\n", - " \"\"\"Reset the memory storage if allowed.\"\"\"\n", - " if not self.allow_reset:\n", - " return\n", - "\n", - " try:\n", - " # Delete documents of this memory type\n", - " self.cluster.query(\n", - " f\"DELETE FROM `{self.bucket_name}`.`{self.scope_name}`.`{self.collection_name}` WHERE memory_type = $type\",\n", - " type=self.type\n", - " ).execute()\n", - " logger.info(f\"Reset memory type: {self.type}\")\n", - " except Exception as e:\n", - " logger.error(f\"Reset failed: {str(e)}\")\n", - " raise\n", - "\n", - " def _initialize_app(self):\n", - " \"\"\"Initialize Couchbase connection and vector store.\"\"\"\n", - " try:\n", - " # Initialize embeddings\n", - " if self.embedder_config and self.embedder_config.get(\"provider\") == \"openai\":\n", - " self.embeddings = OpenAIEmbeddings(\n", - " openai_api_key=os.getenv('OPENAI_API_KEY'),\n", - " model=self.embedder_config.get(\"config\", {}).get(\"model\", \"text-embedding-3-small\")\n", - " )\n", - " else:\n", - " self.embeddings = OpenAIEmbeddings(\n", - " openai_api_key=os.getenv('OPENAI_API_KEY'),\n", - " model=\"text-embedding-3-small\"\n", - " )\n", - "\n", - " # Connect to Couchbase\n", - " auth = PasswordAuthenticator(\n", - " os.getenv('CB_USERNAME', ''),\n", - " os.getenv('CB_PASSWORD', '')\n", - " )\n", - " options = ClusterOptions(auth)\n", - " \n", - " # Initialize cluster connection\n", - " self.cluster = Cluster(os.getenv('CB_HOST', ''), options)\n", - " self.cluster.wait_until_ready(timedelta(seconds=5))\n", - "\n", - " # Check search service\n", - " ping_result = self.cluster.ping()\n", - " search_available = False\n", - " for service_type, endpoints in ping_result.endpoints.items():\n", - " if service_type == ServiceType.Search:\n", - " for endpoint in endpoints:\n", - " if endpoint.state == PingState.OK:\n", - " search_available = True\n", - " logger.info(f\"Search service is responding at: {endpoint.remote}\")\n", - " break\n", - " break\n", - " if not search_available:\n", - " raise RuntimeError(\"Search/FTS service not found or not responding\")\n", - " \n", - " # Set up storage configuration\n", - " self.bucket_name = os.getenv('CB_BUCKET_NAME', 'vector-search-testing')\n", - " self.scope_name = os.getenv('SCOPE_NAME', 'shared')\n", - " self.collection_name = os.getenv('COLLECTION_NAME', 'crew')\n", - " self.index_name = os.getenv('INDEX_NAME', 'vector_search_crew')\n", - "\n", - " # Initialize vector store\n", - " self.vector_store = CouchbaseSearchVectorStore(\n", - " cluster=self.cluster,\n", - " bucket_name=self.bucket_name,\n", - " scope_name=self.scope_name,\n", - " collection_name=self.collection_name,\n", - " embedding=self.embeddings,\n", - " index_name=self.index_name\n", - " )\n", - " logger.info(f\"Initialized CouchbaseStorage for type: {self.type}\")\n", - "\n", - " except Exception as e:\n", - " logger.error(f\"Initialization failed: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Test Basic Storage\n", - "\n", - "Test storing and retrieving a simple memory:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize storage\n", - "storage = CouchbaseStorage(\n", - " type=\"short_term\",\n", - " embedder_config={\n", - " \"provider\": \"openai\",\n", - " \"config\": {\"model\": \"text-embedding-3-small\"}\n", - " }\n", - ")\n", - "\n", - "# Reset storage\n", - "storage.reset()\n", - "\n", - "# Test storage\n", - "test_memory = \"Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.'\"\n", - "test_metadata = {\"category\": \"sports\", \"test\": \"initial_memory\"}\n", - "storage.save(test_memory, test_metadata)\n", - "\n", - "# Test search\n", - "results = storage.search(\"What did Guardiola say about Manchester City?\", limit=1)\n", - "for result in results:\n", - " print(f\"Found: {result['context']}\\nScore: {result['score']}\\nMetadata: {result['metadata']}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Test CrewAI Integration\n", - "\n", - "Create agents and tasks to test memory retention:" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
╭──────────────────────────────────────────── Crew Execution Started ─────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Crew Execution Started                                                                                         \n",
-              "  Name: crew                                                                                                     \n",
-              "  ID: 7ac56ae1-b62f-4b07-952c-104a7243edb0                                                                       \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[36m╭─\u001b[0m\u001b[36m───────────────────────────────────────────\u001b[0m\u001b[36m Crew Execution Started \u001b[0m\u001b[36m────────────────────────────────────────────\u001b[0m\u001b[36m─╮\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[1;36mCrew Execution Started\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[36mcrew\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mID: \u001b[0m\u001b[36m7ac56ae1-b62f-4b07-952c-104a7243edb0\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
-              "\"ipywidgets\" for Jupyter support\n",
-              "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
-              "
\n" - ], - "text/plain": [ - "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", - "\"ipywidgets\" for Jupyter support\n", - " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭────────────────────────────────────────────── 🧠 Retrieved Memory ──────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Historical Data:                                                                                               \n",
-              "  - Ensure that the analysis contains specific examples or statistics to support the claims made about team      \n",
-              "  performance.                                                                                                   \n",
-              "  - Include insights from other sources or viewpoints to provide a well-rounded analysis.                        \n",
-              "  - Provide a comparison with past performance to highlight improvements or consistencies.                       \n",
-              "  - Include player-specific analysis if individual performance is hinted at in the comments.                     \n",
-              "  Entities:                                                                                                      \n",
-              "  - Pep Guardiola(Football Manager): The current manager of Manchester City, known fo...                         \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────── Retrieval Time: 1384.18ms ───────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m─────────────────────────────────────────────\u001b[0m\u001b[32m 🧠 Retrieved Memory \u001b[0m\u001b[32m─────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mHistorical Data:\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Ensure that the analysis contains specific examples or statistics to support the claims made about team \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mperformance.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Include insights from other sources or viewpoints to provide a well-rounded analysis.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Provide a comparison with past performance to highlight improvements or consistencies.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Include player-specific analysis if individual performance is hinted at in the comments.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mEntities:\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Pep Guardiola(Football Manager): The current manager of Manchester City, known fo...\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─\u001b[0m\u001b[32m──────────────────────────────────────────\u001b[0m\u001b[32m Retrieval Time: 1384.18ms \u001b[0m\u001b[32m──────────────────────────────────────────\u001b[0m\u001b[32m─╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
╭─────────────────────────────────────────────── 🤖 Agent Started ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Agent: Sports Analyst                                                                                          \n",
-              "                                                                                                                 \n",
-              "  Task: Analyze Manchester City's recent performance based on Pep Guardiola's comments: \"The team is playing     \n",
-              "  well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"         \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[35m╭─\u001b[0m\u001b[35m──────────────────────────────────────────────\u001b[0m\u001b[35m 🤖 Agent Started \u001b[0m\u001b[35m───────────────────────────────────────────────\u001b[0m\u001b[35m─╮\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mSports Analyst\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mTask: \u001b[0m\u001b[92mAnalyze Manchester City's recent performance based on Pep Guardiola's comments: \"The team is playing \u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[92mwell, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
-              "\"ipywidgets\" for Jupyter support\n",
-              "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
-              "
\n" - ], - "text/plain": [ - "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", - "\"ipywidgets\" for Jupyter support\n", - " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
-              "\"ipywidgets\" for Jupyter support\n",
-              "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
-              "
\n" - ], - "text/plain": [ - "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", - "\"ipywidgets\" for Jupyter support\n", - " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
-              "\"ipywidgets\" for Jupyter support\n",
-              "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
-              "
\n" - ], - "text/plain": [ - "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", - "\"ipywidgets\" for Jupyter support\n", - " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭──────────────────────────────────────────────── Task Completion ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Task Completed                                                                                                 \n",
-              "  Name: 721d99b2-ac47-4976-8862-364bb668075e                                                                     \n",
-              "  Agent: Sports Analyst                                                                                          \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m721d99b2-ac47-4976-8862-364bb668075e\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mSports Analyst\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
-              "\"ipywidgets\" for Jupyter support\n",
-              "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
-              "
\n" - ], - "text/plain": [ - "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", - "\"ipywidgets\" for Jupyter support\n", - " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭────────────────────────────────────────────── 🧠 Retrieved Memory ──────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Historical Data:                                                                                               \n",
-              "  - Include specific quotes from Guardiola to enhance credibility.                                               \n",
-              "  - Incorporate statistical data or match results to provide more depth.                                         \n",
-              "  - Discuss recent matches or events in more detail.                                                             \n",
-              "  - Add perspectives from players or other analysts for a more rounded view.                                     \n",
-              "  - Include potential future challenges for Manchester City.                                                     \n",
-              "  Entities:                                                                                                      \n",
-              "  - Pep Guardiola(Individual): The manager of Manchester City, known for his tactical acumen and positive        \n",
-              "  remarks about the team's performance.                                                                          \n",
-              "  - Manch...                                                                                                     \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────── Retrieval Time: 991.13ms ────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m─────────────────────────────────────────────\u001b[0m\u001b[32m 🧠 Retrieved Memory \u001b[0m\u001b[32m─────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mHistorical Data:\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Include specific quotes from Guardiola to enhance credibility.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Incorporate statistical data or match results to provide more depth.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Discuss recent matches or events in more detail.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Add perspectives from players or other analysts for a more rounded view.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Include potential future challenges for Manchester City.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mEntities:\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Pep Guardiola(Individual): The manager of Manchester City, known for his tactical acumen and positive \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mremarks about the team's performance.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Manch...\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─\u001b[0m\u001b[32m──────────────────────────────────────────\u001b[0m\u001b[32m Retrieval Time: 991.13ms \u001b[0m\u001b[32m───────────────────────────────────────────\u001b[0m\u001b[32m─╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
╭─────────────────────────────────────────────── 🤖 Agent Started ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Agent: Sports Journalist                                                                                       \n",
-              "                                                                                                                 \n",
-              "  Task: Write a sports article about Manchester City's form using the analysis and Guardiola's comments.         \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[35m╭─\u001b[0m\u001b[35m──────────────────────────────────────────────\u001b[0m\u001b[35m 🤖 Agent Started \u001b[0m\u001b[35m───────────────────────────────────────────────\u001b[0m\u001b[35m─╮\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mSports Journalist\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mTask: \u001b[0m\u001b[92mWrite a sports article about Manchester City's form using the analysis and Guardiola's comments.\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
-              "\"ipywidgets\" for Jupyter support\n",
-              "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
-              "
\n" - ], - "text/plain": [ - "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", - "\"ipywidgets\" for Jupyter support\n", - " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
-              "\"ipywidgets\" for Jupyter support\n",
-              "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
-              "
\n" - ], - "text/plain": [ - "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", - "\"ipywidgets\" for Jupyter support\n", - " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
-              "\"ipywidgets\" for Jupyter support\n",
-              "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
-              "
\n" - ], - "text/plain": [ - "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", - "\"ipywidgets\" for Jupyter support\n", - " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭──────────────────────────────────────────────── Task Completion ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Task Completed                                                                                                 \n",
-              "  Name: 4fac1a2b-0fd1-484e-afe6-a4d4af236bd4                                                                     \n",
-              "  Agent: Sports Journalist                                                                                       \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m4fac1a2b-0fd1-484e-afe6-a4d4af236bd4\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mSports Journalist\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Crew Result:\n", - "--------------------------------------------------------------------------------\n", - "**Manchester City's Resilient Form Under Guardiola: A Symphony of Strategy and Skill**\n", - "\n", - "In the ever-competitive landscape of the Premier League, Manchester City continues to set the benchmark for excellence, guided by the strategic genius of Pep Guardiola. Reflecting on their current form, Guardiola's satisfaction is palpable: \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\" These words not only highlight the team's current high morale but also underline the effectiveness of their training routines and the cohesive unit that Guardiola has meticulously crafted.\n", - "\n", - "Historically, Manchester City has been a juggernaut in English football, and their recent performances are a testament to their sustained dominance. Their consistency in maintaining high possession rates and crafting scoring opportunities is unparalleled. Statistically, City often leads in metrics such as ball possession and pass accuracy, with figures regularly surpassing 60% possession in matches, illustrating their control and domination on the pitch.\n", - "\n", - "Key to their success has been the stellar performances of individual players. Kevin De Bruyne's vision and precise passing have been instrumental in creating goal-scoring chances, while Erling Haaland's formidable goal-scoring abilities add a lethal edge to City's attack. Phil Foden's adaptability and technical prowess offer Guardiola the flexibility to shuffle tactics seamlessly. This trident of talent epitomizes the blend of skill and strategy that City embodies.\n", - "\n", - "Defensively, Manchester City has shown marked improvement, a testament to Guardiola's focus on fortifying the backline. Their defensive solidity, coupled with an attacking flair, makes them a daunting adversary for any team. Guardiola's ability to adapt tactics to counter various styles of play is a hallmark of his tenure, ensuring City remains at the pinnacle of competition both domestically and on the European stage.\n", - "\n", - "Analysts and pundits echo Guardiola's sentiments, praising Manchester City's ability to maintain elite standards and adapt to challenges with finesse. This holistic approach—encompassing rigorous training, strategic gameplay, and individual brilliance—cements Manchester City's status as leaders in football excellence.\n", - "\n", - "However, the journey is far from over. As they navigate the rigors of the Premier League and European competitions, potential challenges loom. Sustaining fitness levels, managing squad rotations, and countering tactical innovations from rivals will be pivotal. Yet, with Guardiola at the helm, Manchester City is well-equipped to tackle these challenges head-on.\n", - "\n", - "In conclusion, Manchester City's current form is a shining example of Guardiola's managerial prowess and the team's harmonious performance. Their continued success is a blend of strategic training, tactical adaptability, and outstanding individual contributions, positioning them as formidable contenders in any arena. As the season unfolds, fans and analysts alike will watch with bated breath to see how this footballing symphony continues to play out.\n", - "--------------------------------------------------------------------------------\n" - ] - } - ], - "source": [ - "# Initialize ShortTermMemory with our storage\n", - "memory = ShortTermMemory(storage=storage)\n", - "\n", - "# Initialize language model\n", - "llm = ChatOpenAI(\n", - " model=\"gpt-4o\",\n", - " temperature=0.7\n", - ")\n", - "\n", - "# Create agents with memory\n", - "sports_analyst = Agent(\n", - " role='Sports Analyst',\n", - " goal='Analyze Manchester City performance',\n", - " backstory='Expert at analyzing football teams and providing insights on their performance',\n", - " llm=llm,\n", - " memory=True,\n", - " memory_storage=memory\n", - ")\n", - "\n", - "journalist = Agent(\n", - " role='Sports Journalist',\n", - " goal='Create engaging football articles',\n", - " backstory='Experienced sports journalist who specializes in Premier League coverage',\n", - " llm=llm,\n", - " memory=True,\n", - " memory_storage=memory\n", - ")\n", - "\n", - "# Create tasks\n", - "analysis_task = Task(\n", - " description='Analyze Manchester City\\'s recent performance based on Pep Guardiola\\'s comments: \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"',\n", - " agent=sports_analyst,\n", - " expected_output=\"A comprehensive analysis of Manchester City's current form based on Guardiola's comments.\"\n", - ")\n", - "\n", - "writing_task = Task(\n", - " description='Write a sports article about Manchester City\\'s form using the analysis and Guardiola\\'s comments.',\n", - " agent=journalist,\n", - " context=[analysis_task],\n", - " expected_output=\"An engaging sports article about Manchester City's current form and Guardiola's perspective.\"\n", - ")\n", - "\n", - "# Create crew with memory\n", - "crew = Crew(\n", - " agents=[sports_analyst, journalist],\n", - " tasks=[analysis_task, writing_task],\n", - " process=Process.sequential,\n", - " memory=True,\n", - " short_term_memory=memory, # Explicitly pass our memory implementation\n", - " verbose=True\n", - ")\n", - "\n", - "# Run the crew\n", - "result = crew.kickoff()\n", - "\n", - "print(\"\\nCrew Result:\")\n", - "print(\"-\" * 80)\n", - "print(result)\n", - "print(\"-\" * 80)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Test Memory Retention\n", - "\n", - "Query the stored memories to verify retention:" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "All memory entries in Couchbase:\n", - "--------------------------------------------------------------------------------\n", - "\n", - "Memory Search Results:\n", - "--------------------------------------------------------------------------------\n", - "\n", - "Agent Interaction Memory Results:\n", - "--------------------------------------------------------------------------------\n" - ] - } - ], - "source": [ - "# Wait for memories to be stored\n", - "time.sleep(2)\n", - "\n", - "# List all documents in the collection\n", - "try:\n", - " # Query to fetch all documents of this memory type\n", - " query_str = f\"SELECT META().id, * FROM `{storage.bucket_name}`.`{storage.scope_name}`.`{storage.collection_name}` WHERE memory_type = $type\"\n", - " query_result = storage.cluster.query(query_str, type=storage.type)\n", - " \n", - " print(f\"\\nAll memory entries in Couchbase:\")\n", - " print(\"-\" * 80)\n", - " for i, row in enumerate(query_result, 1):\n", - " doc_id = row.get('id')\n", - " memory_id = row.get(storage.collection_name, {}).get('memory_id', 'unknown')\n", - " content = row.get(storage.collection_name, {}).get('text', '')[:100] + \"...\" # Truncate for readability\n", - " \n", - " print(f\"Entry {i}:\")\n", - " print(f\"ID: {doc_id}\")\n", - " print(f\"Memory ID: {memory_id}\")\n", - " print(f\"Content: {content}\")\n", - " print(\"-\" * 80)\n", - "except Exception as e:\n", - " print(f\"Failed to list memory entries: {str(e)}\")\n", - "\n", - "# Test memory retention\n", - "memory_query = \"What is Manchester City's current form according to Guardiola?\"\n", - "memory_results = storage.search(\n", - " query=memory_query,\n", - " limit=5, # Increased to see more results\n", - " score_threshold=0.0 # Lower threshold to see all results\n", - ")\n", - "\n", - "print(\"\\nMemory Search Results:\")\n", - "print(\"-\" * 80)\n", - "for result in memory_results:\n", - " print(f\"Context: {result['context']}\")\n", - " print(f\"Score: {result['score']}\")\n", - " print(\"-\" * 80)\n", - "\n", - "# Try a more specific query to find agent interactions\n", - "interaction_query = \"Manchester City playing style analysis tactical\"\n", - "interaction_results = storage.search(\n", - " query=interaction_query,\n", - " limit=5,\n", - " score_threshold=0.0\n", - ")\n", - "\n", - "print(\"\\nAgent Interaction Memory Results:\")\n", - "print(\"-\" * 80)\n", - "for result in interaction_results:\n", - " print(f\"Context: {result['context'][:200]}...\") # Limit output size\n", - " print(f\"Score: {result['score']}\")\n", - " print(\"-\" * 80)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.7" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/crewai-short-term-memory/gsi/CouchbaseStorage_Demo.ipynb b/crewai-short-term-memory/gsi/CouchbaseStorage_Demo.ipynb deleted file mode 100644 index 16079a09..00000000 --- a/crewai-short-term-memory/gsi/CouchbaseStorage_Demo.ipynb +++ /dev/null @@ -1,1747 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "fa3af5ad", - "metadata": {}, - "source": [ - "# CrewAI Short-Term Memory with Couchbase GSI Vector Search" - ] - }, - { - "cell_type": "markdown", - "id": "3677c445", - "metadata": {}, - "source": [ - "## Overview" - ] - }, - { - "cell_type": "markdown", - "id": "407ff72e", - "metadata": {}, - "source": [ - "This tutorial shows how to implement a custom memory backend for CrewAI agents using Couchbase's high-performance GSI (Global Secondary Index) vector search. CrewAI agents can retain and recall information across interactions, making them more contextually aware and effective. We'll demonstrate measurable performance improvements with GSI optimization. Alternatively if you want to perform semantic search using the FTS, please take a look at [this.](https://developer.couchbase.com/tutorial-crewai-short-term-memory-couchbase-with-fts)\n", - "\n", - "**Key Features:**\n", - "- Custom CrewAI memory storage with Couchbase GSI vector search\n", - "- High-performance semantic memory retrieval\n", - "- Agent memory persistence across conversations\n", - "- Performance benchmarks showing GSI benefits\n", - "\n", - "**Requirements:** Couchbase Server 8.0+ or Capella with Query Service enabled.\n", - "\n", - "You can access this notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/crewai-short-term-memory/gsi/CouchbaseStorage_Demo.ipynb)." - ] - }, - { - "cell_type": "markdown", - "id": "32f885be", - "metadata": {}, - "source": [ - "## Prerequisites" - ] - }, - { - "cell_type": "markdown", - "id": "4cda6f3d", - "metadata": {}, - "source": [ - "### Couchbase Setup" - ] - }, - { - "cell_type": "markdown", - "id": "a3e10a74", - "metadata": {}, - "source": [ - "1. **Create Capella Account:** Deploy a [free tier cluster](https://cloud.couchbase.com/sign-up)\n", - "2. **Enable Query Service:** Required for GSI vector search\n", - "3. **Configure Access:** Set up database credentials and network security\n", - "4. **Create Bucket:** Manual bucket creation recommended for Capella" - ] - }, - { - "cell_type": "markdown", - "id": "bb26dafe", - "metadata": {}, - "source": [ - "## Understanding Agent Memory" - ] - }, - { - "cell_type": "markdown", - "id": "214ea40b", - "metadata": {}, - "source": [ - "### Why Memory Matters for AI Agents" - ] - }, - { - "cell_type": "markdown", - "id": "c3873132", - "metadata": {}, - "source": [ - "Memory in AI agents is a crucial capability that allows them to retain and utilize information across interactions, making them more effective and contextually aware. Without memory, agents would be limited to processing only the immediate input, lacking the ability to build upon past experiences or maintain continuity in conversations." - ] - }, - { - "cell_type": "markdown", - "id": "eaede747", - "metadata": {}, - "source": [ - "#### Types of Memory in AI Agents" - ] - }, - { - "cell_type": "markdown", - "id": "3346122d", - "metadata": {}, - "source": [ - "**Short-term Memory:**\n", - "- Retains recent interactions and context\n", - "- Typically spans the current conversation or session \n", - "- Helps maintain coherence within a single interaction flow\n", - "- In CrewAI, this is what we're implementing with the Couchbase storage\n", - "\n", - "**Long-term Memory:**\n", - "- Stores persistent knowledge across multiple sessions\n", - "- Enables agents to recall past interactions even after long periods\n", - "- Helps build cumulative knowledge about users, preferences, and past decisions\n", - "- While this implementation is labeled as \"short-term memory\", the Couchbase storage backend can be effectively used for long-term memory as well, thanks to Couchbase's persistent storage capabilities and enterprise-grade durability features" - ] - }, - { - "cell_type": "markdown", - "id": "9d744f4a", - "metadata": {}, - "source": [ - "#### How Memory Works in Agents" - ] - }, - { - "cell_type": "markdown", - "id": "53bd56b7", - "metadata": {}, - "source": [ - "Memory in AI agents typically involves:\n", - "- **Storage**: Information is encoded and stored in a database (like Couchbase, ChromaDB, or other vector stores)\n", - "- **Retrieval**: Relevant memories are fetched based on semantic similarity to current context\n", - "- **Integration**: Retrieved memories are incorporated into the agent's reasoning process\n", - "\n", - "The vector-based approach (using embeddings) is particularly powerful because it allows for semantic search - finding memories that are conceptually related to the current context, not just exact keyword matches." - ] - }, - { - "cell_type": "markdown", - "id": "180704cb", - "metadata": {}, - "source": [ - "#### Benefits of Memory in AI Agents" - ] - }, - { - "cell_type": "markdown", - "id": "5242d1ea", - "metadata": {}, - "source": [ - "- **Contextual Understanding**: Agents can refer to previous parts of a conversation\n", - "- **Personalization**: Remembering user preferences and past interactions\n", - "- **Learning and Adaptation**: Building knowledge over time to improve responses\n", - "- **Task Continuity**: Resuming complex tasks across multiple interactions\n", - "- **Collaboration**: In multi-agent systems like CrewAI, memory enables agents to build on each other's work" - ] - }, - { - "cell_type": "markdown", - "id": "b6a39375", - "metadata": {}, - "source": [ - "#### Memory in CrewAI Specifically" - ] - }, - { - "cell_type": "markdown", - "id": "2f3f0133", - "metadata": {}, - "source": [ - "In CrewAI, memory serves several important functions:\n", - "- **Agent Specialization**: Each agent can maintain its own memory relevant to its expertise\n", - "- **Knowledge Transfer**: Agents can share insights through memory when collaborating on tasks\n", - "- **Process Continuity**: In sequential processes, later agents can access the work of earlier agents\n", - "- **Contextual Awareness**: Agents can reference previous findings when making decisions" - ] - }, - { - "cell_type": "markdown", - "id": "0082810e", - "metadata": {}, - "source": [ - "## Setup and Installation" - ] - }, - { - "cell_type": "markdown", - "id": "c23683a4", - "metadata": {}, - "source": [ - "### Install Required Libraries" - ] - }, - { - "cell_type": "markdown", - "id": "b41c9376", - "metadata": {}, - "source": [ - "Install the necessary packages for CrewAI, Couchbase integration, and OpenAI embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fd5f51cb", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet crewai==0.186.1 langchain-couchbase==0.5.0 langchain-openai==0.3.33 python-dotenv==1.1.1" - ] - }, - { - "cell_type": "markdown", - "id": "5e73ffeb", - "metadata": {}, - "source": [ - "### Import Required Modules" - ] - }, - { - "cell_type": "markdown", - "id": "bd67cca9", - "metadata": {}, - "source": [ - "Import libraries for CrewAI memory storage, Couchbase GSI vector search, and OpenAI embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "4fb688e4", - "metadata": {}, - "outputs": [], - "source": [ - "from typing import Any, Dict, List, Optional\n", - "import os\n", - "import logging\n", - "from datetime import timedelta\n", - "from dotenv import load_dotenv\n", - "from crewai.memory.storage.rag_storage import RAGStorage\n", - "from crewai.memory.short_term.short_term_memory import ShortTermMemory\n", - "from crewai import Agent, Crew, Task, Process\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.options import ClusterOptions\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.diagnostics import PingState, ServiceType\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy\n", - "from langchain_couchbase.vectorstores import IndexType\n", - "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n", - "import time\n", - "import json\n", - "import uuid\n", - "\n", - "# Configure logging (disabled)\n", - "logging.basicConfig(level=logging.CRITICAL)\n", - "logger = logging.getLogger(__name__)" - ] - }, - { - "cell_type": "markdown", - "id": "3c044af6", - "metadata": {}, - "source": [ - "### Environment Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "abe7a8ad", - "metadata": {}, - "source": [ - "Configure environment variables for secure access to Couchbase and OpenAI services. Create a `.env` file with your credentials." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "a1d82bff", - "metadata": {}, - "outputs": [], - "source": [ - "load_dotenv(\"./.env\")\n", - "\n", - "# Verify environment variables\n", - "required_vars = ['OPENAI_API_KEY', 'CB_HOST', 'CB_USERNAME', 'CB_PASSWORD']\n", - "for var in required_vars:\n", - " if not os.getenv(var):\n", - " raise ValueError(f\"{var} environment variable is required\")" - ] - }, - { - "cell_type": "markdown", - "id": "a6c46413", - "metadata": {}, - "source": [ - "## Understanding GSI Vector Search" - ] - }, - { - "cell_type": "markdown", - "id": "2ff0e7b8", - "metadata": {}, - "source": [ - "### GSI Vector Index Types" - ] - }, - { - "cell_type": "markdown", - "id": "fde149b9", - "metadata": {}, - "source": [ - "Couchbase offers two types of GSI vector indexes for different use cases:\n", - "\n", - "**Hyperscale Vector Indexes (BHIVE):**\n", - "- Best for pure vector searches - content discovery, recommendations, semantic search\n", - "- High performance with low memory footprint - designed to scale to billions of vectors\n", - "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", - "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", - "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", - "\n", - "**Composite Vector Indexes:**\n", - "- Best for filtered vector searches - combines vector search with scalar value filtering\n", - "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", - "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", - "\n", - "For this CrewAI memory implementation, we'll use **BHIVE** as it's optimized for pure semantic search scenarios typical in AI agent memory systems." - ] - }, - { - "cell_type": "markdown", - "id": "1dfc28e9", - "metadata": {}, - "source": [ - "### Understanding Index Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "acab2b26", - "metadata": {}, - "source": [ - "The `index_description` parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", - "\n", - "**Format**: `'IVF[],{PQ|SQ}'`\n", - "\n", - "**Centroids (IVF - Inverted File):**\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- More centroids = faster search, slower training \n", - "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", - "\n", - "**Quantization Options:**\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", - "- Higher values = better accuracy, larger index size\n", - "\n", - "**Common Examples:**\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n" - ] - }, - { - "cell_type": "markdown", - "id": "3c7f0633", - "metadata": {}, - "source": [ - "## Custom CouchbaseStorage Implementation" - ] - }, - { - "cell_type": "markdown", - "id": "5df3792c", - "metadata": {}, - "source": [ - "### CouchbaseStorage Class" - ] - }, - { - "cell_type": "markdown", - "id": "a5e8abec", - "metadata": {}, - "source": [ - "This class extends CrewAI's `RAGStorage` to provide GSI vector search capabilities for agent memory." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "b29c4840", - "metadata": {}, - "outputs": [], - "source": [ - "class CouchbaseStorage(RAGStorage):\n", - " \"\"\"\n", - " Extends RAGStorage to handle embeddings for memory entries using Couchbase GSI Vector Search.\n", - " \"\"\"\n", - "\n", - " def __init__(self, type: str, allow_reset: bool = True, embedder_config: Optional[Dict[str, Any]] = None, crew: Optional[Any] = None):\n", - " \"\"\"Initialize CouchbaseStorage with GSI vector search configuration.\"\"\"\n", - " super().__init__(type, allow_reset, embedder_config, crew)\n", - " self._initialize_app()\n", - "\n", - " def search(\n", - " self,\n", - " query: str,\n", - " limit: int = 3,\n", - " filter: Optional[dict] = None,\n", - " score_threshold: float = 0,\n", - " ) -> List[Dict[str, Any]]:\n", - " \"\"\"\n", - " Search memory entries using GSI vector similarity.\n", - " \"\"\"\n", - " try:\n", - " # Add type filter\n", - " search_filter = {\"memory_type\": self.type}\n", - " if filter:\n", - " search_filter.update(filter)\n", - "\n", - " # Execute search using GSI vector search\n", - " results = self.vector_store.similarity_search_with_score(\n", - " query,\n", - " k=limit,\n", - " filter=search_filter\n", - " )\n", - " \n", - " # Format results and deduplicate by content\n", - " seen_contents = set()\n", - " formatted_results = []\n", - " \n", - " for i, (doc, distance) in enumerate(results):\n", - " # Note: In GSI vector search, lower distance indicates higher similarity\n", - " if distance <= (1.0 - score_threshold): # Convert threshold for GSI distance metric\n", - " content = doc.page_content\n", - " if content not in seen_contents:\n", - " seen_contents.add(content)\n", - " formatted_results.append({\n", - " \"id\": doc.metadata.get(\"memory_id\", str(i)),\n", - " \"metadata\": doc.metadata,\n", - " \"context\": content,\n", - " \"distance\": float(distance) # Changed from score to distance\n", - " })\n", - " \n", - " logger.info(f\"Found {len(formatted_results)} unique results for query: {query}\")\n", - " return formatted_results\n", - "\n", - " except Exception as e:\n", - " logger.error(f\"Search failed: {str(e)}\")\n", - " return []\n", - "\n", - " def save(self, value: Any, metadata: Dict[str, Any]) -> None:\n", - " \"\"\"\n", - " Save a memory entry with metadata.\n", - " \"\"\"\n", - " try:\n", - " # Generate unique ID\n", - " memory_id = str(uuid.uuid4())\n", - " timestamp = int(time.time() * 1000)\n", - " \n", - " # Prepare metadata (create a copy to avoid modifying references)\n", - " if not metadata:\n", - " metadata = {}\n", - " else:\n", - " metadata = metadata.copy() # Create a copy to avoid modifying references\n", - " \n", - " # Process agent-specific information if present\n", - " agent_name = metadata.get('agent', 'unknown')\n", - " \n", - " # Clean up value if it has the typical LLM response format\n", - " value_str = str(value)\n", - " if \"Final Answer:\" in value_str:\n", - " # Extract just the actual content - everything after \"Final Answer:\"\n", - " parts = value_str.split(\"Final Answer:\", 1)\n", - " if len(parts) > 1:\n", - " value = parts[1].strip()\n", - " logger.info(f\"Cleaned up response format for agent: {agent_name}\")\n", - " elif value_str.startswith(\"Thought:\"):\n", - " # Handle thought/final answer format\n", - " if \"Final Answer:\" in value_str:\n", - " parts = value_str.split(\"Final Answer:\", 1)\n", - " if len(parts) > 1:\n", - " value = parts[1].strip()\n", - " logger.info(f\"Cleaned up thought process format for agent: {agent_name}\")\n", - " \n", - " # Update metadata\n", - " metadata.update({\n", - " \"memory_id\": memory_id,\n", - " \"memory_type\": self.type,\n", - " \"timestamp\": timestamp,\n", - " \"source\": \"crewai\"\n", - " })\n", - "\n", - " # Log memory information for debugging\n", - " value_preview = str(value)[:100] + \"...\" if len(str(value)) > 100 else str(value)\n", - " metadata_preview = {k: v for k, v in metadata.items() if k != \"embedding\"}\n", - " logger.info(f\"Saving memory for Agent: {agent_name}\")\n", - " logger.info(f\"Memory value preview: {value_preview}\")\n", - " logger.info(f\"Memory metadata: {metadata_preview}\")\n", - " \n", - " # Convert value to string if needed\n", - " if isinstance(value, (dict, list)):\n", - " value = json.dumps(value)\n", - " elif not isinstance(value, str):\n", - " value = str(value)\n", - "\n", - " # Save to GSI vector store\n", - " self.vector_store.add_texts(\n", - " texts=[value],\n", - " metadatas=[metadata],\n", - " ids=[memory_id]\n", - " )\n", - " logger.info(f\"Saved memory {memory_id}: {value[:100]}...\")\n", - "\n", - " except Exception as e:\n", - " logger.error(f\"Save failed: {str(e)}\")\n", - " raise\n", - "\n", - " def reset(self) -> None:\n", - " \"\"\"Reset the memory storage if allowed.\"\"\"\n", - " if not self.allow_reset:\n", - " return\n", - "\n", - " try:\n", - " # Delete documents of this memory type\n", - " self.cluster.query(\n", - " f\"DELETE FROM `{self.bucket_name}`.`{self.scope_name}`.`{self.collection_name}` WHERE memory_type = $type\",\n", - " type=self.type\n", - " ).execute()\n", - " logger.info(f\"Reset memory type: {self.type}\")\n", - " except Exception as e:\n", - " logger.error(f\"Reset failed: {str(e)}\")\n", - " raise\n", - "\n", - " def _initialize_app(self):\n", - " \"\"\"Initialize Couchbase connection and GSI vector store.\"\"\"\n", - " try:\n", - " # Initialize embeddings\n", - " if self.embedder_config and self.embedder_config.get(\"provider\") == \"openai\":\n", - " self.embeddings = OpenAIEmbeddings(\n", - " openai_api_key=os.getenv('OPENAI_API_KEY'),\n", - " model=self.embedder_config.get(\"config\", {}).get(\"model\", \"text-embedding-3-small\")\n", - " )\n", - " else:\n", - " self.embeddings = OpenAIEmbeddings(\n", - " openai_api_key=os.getenv('OPENAI_API_KEY'),\n", - " model=\"text-embedding-3-small\"\n", - " )\n", - "\n", - " # Connect to Couchbase\n", - " auth = PasswordAuthenticator(\n", - " os.getenv('CB_USERNAME', ''),\n", - " os.getenv('CB_PASSWORD', '')\n", - " )\n", - " options = ClusterOptions(auth)\n", - " \n", - " # Initialize cluster connection\n", - " self.cluster = Cluster(os.getenv('CB_HOST', ''), options)\n", - " self.cluster.wait_until_ready(timedelta(seconds=5))\n", - "\n", - " # Check Query service (required for GSI vector search)\n", - " ping_result = self.cluster.ping()\n", - " query_available = False\n", - " for service_type, endpoints in ping_result.endpoints.items():\n", - " if service_type.name == 'Query': # Query Service for GSI\n", - " for endpoint in endpoints:\n", - " if endpoint.state == PingState.OK:\n", - " query_available = True\n", - " logger.info(f\"Query service is responding at: {endpoint.remote}\")\n", - " break\n", - " break\n", - " if not query_available:\n", - " raise RuntimeError(\"Query service not found or not responding. GSI vector search requires Query Service.\")\n", - " \n", - " # Set up storage configuration\n", - " self.bucket_name = os.getenv('CB_BUCKET_NAME', 'vector-search-testing')\n", - " self.scope_name = os.getenv('SCOPE_NAME', 'shared')\n", - " self.collection_name = os.getenv('COLLECTION_NAME', 'crew')\n", - " self.index_name = os.getenv('INDEX_NAME', 'vector_search_crew_gsi')\n", - "\n", - " # Initialize GSI vector store\n", - " self.vector_store = CouchbaseQueryVectorStore(\n", - " cluster=self.cluster,\n", - " bucket_name=self.bucket_name,\n", - " scope_name=self.scope_name,\n", - " collection_name=self.collection_name,\n", - " embedding=self.embeddings,\n", - " distance_metric=DistanceStrategy.COSINE,\n", - " )\n", - " logger.info(f\"Initialized CouchbaseStorage with GSI vector search for type: {self.type}\")\n", - "\n", - " except Exception as e:\n", - " logger.error(f\"Initialization failed: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "id": "3566d5bf", - "metadata": {}, - "source": [ - "## Memory Search Performance Testing" - ] - }, - { - "cell_type": "markdown", - "id": "ff154822", - "metadata": {}, - "source": [ - "Now let's demonstrate the performance benefits of GSI optimization by testing pure memory search performance. We'll compare three optimization levels:\n", - "\n", - "1. **Baseline Performance**: Memory search without GSI optimization\n", - "2. **GSI-Optimized Performance**: Same search with BHIVE GSI index\n", - "3. **Cache Benefits**: Show how caching can be applied on top of GSI for repeated queries\n", - "\n", - "**Important**: This testing focuses on pure memory search performance, isolating the GSI improvements from CrewAI agent workflow overhead." - ] - }, - { - "cell_type": "markdown", - "id": "29717ea7", - "metadata": {}, - "source": [ - "### Initialize Storage and Test Functions" - ] - }, - { - "cell_type": "markdown", - "id": "7f41c284", - "metadata": {}, - "source": [ - "First, let's set up the storage and create test functions for measuring memory search performance." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "06349452", - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize storage\n", - "storage = CouchbaseStorage(\n", - " type=\"short_term\",\n", - " embedder_config={\n", - " \"provider\": \"openai\",\n", - " \"config\": {\"model\": \"text-embedding-3-small\"}\n", - " }\n", - ")\n", - "\n", - "# Reset storage\n", - "storage.reset()\n", - "\n", - "# Test storage\n", - "test_memory = \"Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.'\"\n", - "test_metadata = {\"category\": \"sports\", \"test\": \"initial_memory\"}\n", - "storage.save(test_memory, test_metadata)\n", - "\n", - "import time\n", - "\n", - "def test_memory_search_performance(storage, query, label=\"Memory Search\"):\n", - " \"\"\"Test pure memory search performance and return timing metrics\"\"\"\n", - " print(f\"\\n[{label}] Testing memory search performance\")\n", - " print(f\"[{label}] Query: '{query}'\")\n", - " \n", - " start_time = time.time()\n", - " \n", - " try:\n", - " results = storage.search(query, limit=3)\n", - " end_time = time.time()\n", - " search_time = end_time - start_time\n", - " \n", - " print(f\"[{label}] Memory search completed in {search_time:.4f} seconds\")\n", - " print(f\"[{label}] Found {len(results)} memories\")\n", - " \n", - " if results:\n", - " print(f\"[{label}] Top result distance: {results[0]['distance']:.6f} (lower = more similar)\")\n", - " preview = results[0]['context'][:100] + \"...\" if len(results[0]['context']) > 100 else results[0]['context']\n", - " print(f\"[{label}] Top result preview: {preview}\")\n", - " \n", - " return search_time\n", - " except Exception as e:\n", - " print(f\"[{label}] Memory search failed: {str(e)}\")\n", - " return None" - ] - }, - { - "cell_type": "markdown", - "id": "198a7939", - "metadata": {}, - "source": [ - "### Test 1: Baseline Performance (No GSI Index)" - ] - }, - { - "cell_type": "markdown", - "id": "ef5d4fde", - "metadata": {}, - "source": [ - "Test pure memory search performance without GSI optimization." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "383bb87d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Testing baseline memory search performance without GSI optimization...\n", - "\n", - "[Baseline Search] Testing memory search performance\n", - "[Baseline Search] Query: 'What did Guardiola say about Manchester City?'\n", - "[Baseline Search] Memory search completed in 0.6159 seconds\n", - "[Baseline Search] Found 1 memories\n", - "[Baseline Search] Top result distance: 0.340130 (lower = more similar)\n", - "[Baseline Search] Top result preview: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a ...\n", - "\n", - "Baseline memory search time (without GSI): 0.6159 seconds\n", - "\n" - ] - } - ], - "source": [ - "# Test baseline memory search performance without GSI index\n", - "test_query = \"What did Guardiola say about Manchester City?\"\n", - "print(\"Testing baseline memory search performance without GSI optimization...\")\n", - "baseline_time = test_memory_search_performance(storage, test_query, \"Baseline Search\")\n", - "print(f\"\\nBaseline memory search time (without GSI): {baseline_time:.4f} seconds\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "a88e1719", - "metadata": {}, - "source": [ - "### Create BHIVE GSI Index" - ] - }, - { - "cell_type": "markdown", - "id": "be7acf07", - "metadata": {}, - "source": [ - "Now let's create a BHIVE GSI vector index to enable high-performance memory searches. The index creation is done programmatically through the vector store." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "bde97a46", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Creating BHIVE GSI vector index...\n", - "GSI Vector index created successfully: vector_search_crew\n", - "Waiting for index to become available...\n" - ] - } - ], - "source": [ - "# Create GSI BHIVE vector index for optimal performance\n", - "print(\"Creating BHIVE GSI vector index...\")\n", - "try:\n", - " storage.vector_store.create_index(\n", - " index_type=IndexType.BHIVE,\n", - " # index_type=IndexType.COMPOSITE, # Uncomment this line to create a COMPOSITE index instead\n", - " index_name=storage.index_name,\n", - " index_description=\"IVF,SQ8\" # Auto-selected centroids with 8-bit scalar quantization\n", - " )\n", - " print(f\"GSI Vector index created successfully: {storage.index_name}\")\n", - " \n", - " # Wait for index to become available\n", - " print(\"Waiting for index to become available...\")\n", - " time.sleep(5)\n", - " \n", - "except Exception as e:\n", - " if \"already exists\" in str(e).lower():\n", - " print(f\"GSI vector index '{storage.index_name}' already exists, proceeding...\")\n", - " else:\n", - " print(f\"Error creating GSI index: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "c389eecb", - "metadata": {}, - "source": [ - "### Alternative: Composite Index Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "4e7555da", - "metadata": {}, - "source": [ - "If your agent memory use case requires complex filtering with scalar attributes, you can create a **Composite index** instead by changing the configuration above:\n", - "\n", - "```python\n", - "# Alternative: Create a Composite index for filtered memory searches\n", - "storage.vector_store.create_index(\n", - " index_type=IndexType.COMPOSITE, # Instead of IndexType.BHIVE\n", - " index_name=storage.index_name,\n", - " index_description=\"IVF,SQ8\" # Same quantization settings\n", - ")\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "8e719352", - "metadata": {}, - "source": [ - "### Test 2: GSI-Optimized Performance" - ] - }, - { - "cell_type": "markdown", - "id": "5d786f04", - "metadata": {}, - "source": [ - "Test the same memory search with BHIVE GSI optimization." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "849758ae", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Testing memory search performance with BHIVE GSI optimization...\n", - "\n", - "[GSI-Optimized Search] Testing memory search performance\n", - "[GSI-Optimized Search] Query: 'What did Guardiola say about Manchester City?'\n", - "[GSI-Optimized Search] Memory search completed in 0.5910 seconds\n", - "[GSI-Optimized Search] Found 1 memories\n", - "[GSI-Optimized Search] Top result distance: 0.340142 (lower = more similar)\n", - "[GSI-Optimized Search] Top result preview: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a ...\n" - ] - } - ], - "source": [ - "# Test memory search performance with GSI index\n", - "print(\"Testing memory search performance with BHIVE GSI optimization...\")\n", - "gsi_time = test_memory_search_performance(storage, test_query, \"GSI-Optimized Search\")" - ] - }, - { - "cell_type": "markdown", - "id": "905cf62e", - "metadata": {}, - "source": [ - "### Test 3: Cache Benefits Testing" - ] - }, - { - "cell_type": "markdown", - "id": "a704c5c1", - "metadata": {}, - "source": [ - "Now let's demonstrate how caching can improve performance for repeated queries. **Note**: Caching benefits apply to both baseline and GSI-optimized searches." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "febeab1f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Testing cache benefits with memory search...\n", - "First execution (cache miss):\n", - "\n", - "[Cache Test - First Run] Testing memory search performance\n", - "[Cache Test - First Run] Query: 'How is Manchester City performing in training sessions?'\n", - "[Cache Test - First Run] Memory search completed in 0.6076 seconds\n", - "[Cache Test - First Run] Found 1 memories\n", - "[Cache Test - First Run] Top result distance: 0.379242 (lower = more similar)\n", - "[Cache Test - First Run] Top result preview: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a ...\n", - "\n", - "Second execution (cache hit - should be faster):\n", - "\n", - "[Cache Test - Second Run] Testing memory search performance\n", - "[Cache Test - Second Run] Query: 'How is Manchester City performing in training sessions?'\n", - "[Cache Test - Second Run] Memory search completed in 0.4745 seconds\n", - "[Cache Test - Second Run] Found 1 memories\n", - "[Cache Test - Second Run] Top result distance: 0.379200 (lower = more similar)\n", - "[Cache Test - Second Run] Top result preview: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a ...\n" - ] - } - ], - "source": [ - "# Test cache benefits with a different query to avoid interference\n", - "cache_test_query = \"How is Manchester City performing in training sessions?\"\n", - "\n", - "print(\"Testing cache benefits with memory search...\")\n", - "print(\"First execution (cache miss):\")\n", - "cache_time_1 = test_memory_search_performance(storage, cache_test_query, \"Cache Test - First Run\")\n", - "\n", - "print(\"\\nSecond execution (cache hit - should be faster):\")\n", - "cache_time_2 = test_memory_search_performance(storage, cache_test_query, \"Cache Test - Second Run\")" - ] - }, - { - "cell_type": "markdown", - "id": "0cd9de44", - "metadata": {}, - "source": [ - "### Memory Search Performance Analysis" - ] - }, - { - "cell_type": "markdown", - "id": "f475ccc3", - "metadata": {}, - "source": [ - "Let's analyze the memory search performance improvements across all optimization levels:" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "f813eb1a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "================================================================================\n", - "MEMORY SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", - "================================================================================\n", - "Phase 1 - Baseline Search (No GSI): 0.6159 seconds\n", - "Phase 2 - GSI-Optimized Search: 0.5910 seconds\n", - "Phase 3 - Cache Benefits:\n", - " First execution (cache miss): 0.6076 seconds\n", - " Second execution (cache hit): 0.4745 seconds\n", - "\n", - "--------------------------------------------------------------------------------\n", - "MEMORY SEARCH OPTIMIZATION IMPACT:\n", - "--------------------------------------------------------------------------------\n", - "GSI Index Benefit: 1.04x faster (4.0% improvement)\n", - "Cache Benefit: 1.28x faster (21.9% improvement)\n", - "\n", - "Key Insights for Agent Memory Performance:\n", - "• GSI BHIVE indexes provide significant performance improvements for memory search\n", - "• Performance gains are most dramatic for complex semantic memory queries\n", - "• BHIVE optimization is particularly effective for agent conversational memory\n", - "• Combined with proper quantization (SQ8), GSI delivers production-ready performance\n", - "• These performance improvements directly benefit agent response times and scalability\n" - ] - } - ], - "source": [ - "print(\"\\n\" + \"=\"*80)\n", - "print(\"MEMORY SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", - "print(\"=\"*80)\n", - "\n", - "print(f\"Phase 1 - Baseline Search (No GSI): {baseline_time:.4f} seconds\")\n", - "print(f\"Phase 2 - GSI-Optimized Search: {gsi_time:.4f} seconds\")\n", - "if cache_time_1 and cache_time_2:\n", - " print(f\"Phase 3 - Cache Benefits:\")\n", - " print(f\" First execution (cache miss): {cache_time_1:.4f} seconds\")\n", - " print(f\" Second execution (cache hit): {cache_time_2:.4f} seconds\")\n", - "\n", - "print(\"\\n\" + \"-\"*80)\n", - "print(\"MEMORY SEARCH OPTIMIZATION IMPACT:\")\n", - "print(\"-\"*80)\n", - "\n", - "# GSI improvement analysis\n", - "if baseline_time and gsi_time:\n", - " speedup = baseline_time / gsi_time if gsi_time > 0 else float('inf')\n", - " time_saved = baseline_time - gsi_time\n", - " percent_improvement = (time_saved / baseline_time) * 100\n", - " print(f\"GSI Index Benefit: {speedup:.2f}x faster ({percent_improvement:.1f}% improvement)\")\n", - "\n", - "# Cache improvement analysis\n", - "if cache_time_1 and cache_time_2 and cache_time_2 < cache_time_1:\n", - " cache_speedup = cache_time_1 / cache_time_2\n", - " cache_improvement = ((cache_time_1 - cache_time_2) / cache_time_1) * 100\n", - " print(f\"Cache Benefit: {cache_speedup:.2f}x faster ({cache_improvement:.1f}% improvement)\")\n", - "else:\n", - " print(f\"Cache Benefit: Variable (depends on query complexity and caching mechanism)\")\n", - "\n", - "print(f\"\\nKey Insights for Agent Memory Performance:\")\n", - "print(f\"• GSI BHIVE indexes provide significant performance improvements for memory search\")\n", - "print(f\"• Performance gains are most dramatic for complex semantic memory queries\")\n", - "print(f\"• BHIVE optimization is particularly effective for agent conversational memory\")\n", - "print(f\"• Combined with proper quantization (SQ8), GSI delivers production-ready performance\")\n", - "print(f\"• These performance improvements directly benefit agent response times and scalability\")" - ] - }, - { - "cell_type": "markdown", - "id": "c4b069f8", - "metadata": {}, - "source": [ - "**Note on BHIVE GSI Performance:** The BHIVE GSI index may show slower performance for very small datasets (few documents) due to the additional overhead of maintaining the index structure. However, as the dataset scales up, the BHIVE GSI index becomes significantly faster than traditional vector searches. The initial overhead investment pays off dramatically with larger memory stores, making it essential for production agent deployments with substantial conversational history." - ] - }, - { - "cell_type": "markdown", - "id": "126d4fcf", - "metadata": {}, - "source": [ - "## CrewAI Agent Memory Demo" - ] - }, - { - "cell_type": "markdown", - "id": "a3c67329", - "metadata": {}, - "source": [ - "### What is CrewAI Agent Memory?" - ] - }, - { - "cell_type": "markdown", - "id": "8f71f9ec", - "metadata": {}, - "source": [ - "Now that we've optimized our memory search performance, let's demonstrate how CrewAI agents can leverage this GSI-optimized memory system. CrewAI agent memory enables:\n", - "\n", - "- **Persistent Context**: Agents remember information across conversations and tasks\n", - "- **Semantic Recall**: Agents can find relevant memories using natural language queries\n", - "- **Collaborative Memory**: Multiple agents can share and build upon each other's memories\n", - "- **Performance Benefits**: Our GSI optimizations directly improve agent memory retrieval speed\n", - "\n", - "This demo shows how the memory performance improvements we validated translate to real agent workflows." - ] - }, - { - "cell_type": "markdown", - "id": "0ea8887d", - "metadata": {}, - "source": [ - "### Create Agents with Optimized Memory" - ] - }, - { - "cell_type": "markdown", - "id": "bdf480e7", - "metadata": {}, - "source": [ - "Set up CrewAI agents that use our GSI-optimized Couchbase memory storage for fast, contextual memory retrieval." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "509767fb", - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize ShortTermMemory with our storage\n", - "memory = ShortTermMemory(storage=storage)\n", - "\n", - "# Initialize language model\n", - "llm = ChatOpenAI(\n", - " model=\"gpt-4o\",\n", - " temperature=0.7\n", - ")\n", - "\n", - "# Create agents with memory\n", - "sports_analyst = Agent(\n", - " role='Sports Analyst',\n", - " goal='Analyze Manchester City performance',\n", - " backstory='Expert at analyzing football teams and providing insights on their performance',\n", - " llm=llm,\n", - " memory=True,\n", - " memory_storage=memory\n", - ")\n", - "\n", - "journalist = Agent(\n", - " role='Sports Journalist',\n", - " goal='Create engaging football articles',\n", - " backstory='Experienced sports journalist who specializes in Premier League coverage',\n", - " llm=llm,\n", - " memory=True,\n", - " memory_storage=memory\n", - ")\n", - "\n", - "# Create tasks\n", - "analysis_task = Task(\n", - " description='Analyze Manchester City\\'s recent performance based on Pep Guardiola\\'s comments: \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"',\n", - " agent=sports_analyst,\n", - " expected_output=\"A comprehensive analysis of Manchester City's current form based on Guardiola's comments.\"\n", - ")\n", - "\n", - "writing_task = Task(\n", - " description='Write a sports article about Manchester City\\'s form using the analysis and Guardiola\\'s comments.',\n", - " agent=journalist,\n", - " context=[analysis_task],\n", - " expected_output=\"An engaging sports article about Manchester City's current form and Guardiola's perspective.\"\n", - ")\n", - "\n", - "# Create crew with memory\n", - "crew = Crew(\n", - " agents=[sports_analyst, journalist],\n", - " tasks=[analysis_task, writing_task],\n", - " process=Process.sequential,\n", - " memory=True,\n", - " short_term_memory=memory, # Explicitly pass our memory implementation\n", - " verbose=True\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "950636f7", - "metadata": {}, - "source": [ - "### Run Agent Memory Demo" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "95c612da", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Running CrewAI agents with GSI-optimized memory storage...\n" - ] - }, - { - "data": { - "text/html": [ - "
╭──────────────────────────────────────────── Crew Execution Started ─────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Crew Execution Started                                                                                         \n",
-              "  Name: crew                                                                                                     \n",
-              "  ID: 38d8c744-17cf-4aef-b246-3ff3a930ca29                                                                       \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[36m╭─\u001b[0m\u001b[36m───────────────────────────────────────────\u001b[0m\u001b[36m Crew Execution Started \u001b[0m\u001b[36m────────────────────────────────────────────\u001b[0m\u001b[36m─╮\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[1;36mCrew Execution Started\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[36mcrew\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mID: \u001b[0m\u001b[36m38d8c744-17cf-4aef-b246-3ff3a930ca29\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭────────────────────────────────────────────── 🧠 Retrieved Memory ──────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Historical Data:                                                                                               \n",
-              "  - Ensure that the actual output directly addresses the task description and expected output.                   \n",
-              "  - Include more specific statistical data and recent match examples to support the analysis.                    \n",
-              "  - Incorporate more direct quotes from Pep Guardiola or other relevant stakeholders.                            \n",
-              "  - Address potential biases in Guardiola's comments and provide a balanced view considering external opinions.  \n",
-              "  - Explore deeper tactical analysis to provide more insights into the team's performance.                       \n",
-              "  - Mention fu...                                                                                                \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────── Retrieval Time: 1503.80ms ───────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m─────────────────────────────────────────────\u001b[0m\u001b[32m 🧠 Retrieved Memory \u001b[0m\u001b[32m─────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mHistorical Data:\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Ensure that the actual output directly addresses the task description and expected output.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Include more specific statistical data and recent match examples to support the analysis.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Incorporate more direct quotes from Pep Guardiola or other relevant stakeholders.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Address potential biases in Guardiola's comments and provide a balanced view considering external opinions.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Explore deeper tactical analysis to provide more insights into the team's performance.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Mention fu...\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─\u001b[0m\u001b[32m──────────────────────────────────────────\u001b[0m\u001b[32m Retrieval Time: 1503.80ms \u001b[0m\u001b[32m──────────────────────────────────────────\u001b[0m\u001b[32m─╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
╭─────────────────────────────────────────────── 🤖 Agent Started ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Agent: Sports Analyst                                                                                          \n",
-              "                                                                                                                 \n",
-              "  Task: Analyze Manchester City's recent performance based on Pep Guardiola's comments: \"The team is playing     \n",
-              "  well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"         \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[35m╭─\u001b[0m\u001b[35m──────────────────────────────────────────────\u001b[0m\u001b[35m 🤖 Agent Started \u001b[0m\u001b[35m───────────────────────────────────────────────\u001b[0m\u001b[35m─╮\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mSports Analyst\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mTask: \u001b[0m\u001b[92mAnalyze Manchester City's recent performance based on Pep Guardiola's comments: \"The team is playing \u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[92mwell, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭──────────────────────────────────────────────── Task Completion ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Task Completed                                                                                                 \n",
-              "  Name: bd1a6f7d-9d37-47f0-98ce-2420c3175312                                                                     \n",
-              "  Agent: Sports Analyst                                                                                          \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[32mbd1a6f7d-9d37-47f0-98ce-2420c3175312\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mSports Analyst\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭────────────────────────────────────────────── 🧠 Retrieved Memory ──────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Historical Data:                                                                                               \n",
-              "  - Ensure that the article includes direct quotes from Guardiola if possible to enhance credibility.            \n",
-              "  - Include more detailed statistical analysis or comparisons with previous seasons for a deeper insight into    \n",
-              "  the team's form.                                                                                               \n",
-              "  - Incorporate players' and experts' opinions or commentary to provide a well-rounded perspective.              \n",
-              "  - Add a section discussing future challenges or key upcoming matches for Manchester City.                      \n",
-              "  - Consider incorporating multimedia elements like images or videos ...                                         \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────── Retrieval Time: 854.27ms ────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m─────────────────────────────────────────────\u001b[0m\u001b[32m 🧠 Retrieved Memory \u001b[0m\u001b[32m─────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mHistorical Data:\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Ensure that the article includes direct quotes from Guardiola if possible to enhance credibility.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Include more detailed statistical analysis or comparisons with previous seasons for a deeper insight into \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mthe team's form.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Incorporate players' and experts' opinions or commentary to provide a well-rounded perspective.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Add a section discussing future challenges or key upcoming matches for Manchester City.\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37m- Consider incorporating multimedia elements like images or videos ...\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─\u001b[0m\u001b[32m──────────────────────────────────────────\u001b[0m\u001b[32m Retrieval Time: 854.27ms \u001b[0m\u001b[32m───────────────────────────────────────────\u001b[0m\u001b[32m─╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
╭─────────────────────────────────────────────── 🤖 Agent Started ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Agent: Sports Journalist                                                                                       \n",
-              "                                                                                                                 \n",
-              "  Task: Write a sports article about Manchester City's form using the analysis and Guardiola's comments.         \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[35m╭─\u001b[0m\u001b[35m──────────────────────────────────────────────\u001b[0m\u001b[35m 🤖 Agent Started \u001b[0m\u001b[35m───────────────────────────────────────────────\u001b[0m\u001b[35m─╮\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mSports Journalist\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mTask: \u001b[0m\u001b[92mWrite a sports article about Manchester City's form using the analysis and Guardiola's comments.\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭──────────────────────────────────────────────── Task Completion ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Task Completed                                                                                                 \n",
-              "  Name: 8bcffe0e-5a64-4e12-8207-e0f8701d847b                                                                     \n",
-              "  Agent: Sports Journalist                                                                                       \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m8bcffe0e-5a64-4e12-8207-e0f8701d847b\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mSports Journalist\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "================================================================================\n", - "CREWAI AGENT MEMORY DEMO RESULT\n", - "================================================================================\n", - "**Manchester City’s Impeccable Form: A Reflection of Guardiola’s Philosophy**\n", - "\n", - "Manchester City has been turning heads with their exceptional form under the astute guidance of Pep Guardiola. The team’s recent performances have not only aligned seamlessly with their manager’s philosophy but have also placed them in a formidable position across various competitions. Guardiola himself expressed his satisfaction, stating, \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"\n", - "\n", - "City’s prowess has been evident both domestically and in international arenas. A key factor in their success is their meticulous training regimen, which has fostered strategic flexibility, a hallmark of Guardiola’s management. Over the past few matches, Manchester City has consistently maintained a high possession rate, often exceeding 60%. This high possession allows them to control the tempo and dictate the flow of the game, a crucial component of their strategy.\n", - "\n", - "A recent standout performance was their dominant victory against a top Premier League rival. In this match, City showcased their attacking capabilities and defensive solidity, managing to keep a clean sheet. The contributions of key players like Kevin De Bruyne and Erling Haaland have been instrumental. De Bruyne’s creativity and passing range have opened multiple avenues for attack, while Haaland’s clinical finishing has consistently troubled defenses.\n", - "\n", - "Guardiola’s system, which relies heavily on positional play and fluid movement, has been a critical factor in their ability to break down opposition defenses with quick, incisive passes. The team’s pressing game has also been a cornerstone of their strategy, allowing them to win back possession high up the pitch and quickly transition to attack.\n", - "\n", - "Despite the glowing form and Guardiola’s positive outlook, it’s important to acknowledge potential areas for improvement. While their attack is formidable, City has shown occasional vulnerability to counter-attacks, particularly when their full-backs are positioned high up the field. Addressing these defensive transitions will be crucial, especially against teams with quick counter-attacking capabilities.\n", - "\n", - "Looking forward, Manchester City’s current form is a strong foundation for upcoming challenges, including key fixtures in the Premier League and the knockout stages of the UEFA Champions League. Maintaining this performance level will be essential as they pursue multiple titles. The team’s depth, strategic versatility, and Guardiola’s leadership will be decisive factors in sustaining their momentum.\n", - "\n", - "In conclusion, Manchester City is indeed in a \"good moment,\" as Guardiola aptly puts it. Their recent performances reflect a well-oiled machine operating at high efficiency. However, the team must remain vigilant about potential weaknesses and continue adapting tactically to ensure their current form translates into long-term success. As they aim for glory, the synergy between Guardiola’s strategic mastermind and the players’ execution will undoubtedly be the key to their triumphs.\n", - "================================================================================\n", - "\n", - "✅ CrewAI agents completed successfully in 37.60 seconds!\n", - "✅ Agents used GSI-optimized Couchbase memory storage for fast retrieval!\n", - "✅ Memory will persist across sessions for continued learning and context retention!\n" - ] - } - ], - "source": [ - "# Run the crew with optimized GSI memory\n", - "print(\"Running CrewAI agents with GSI-optimized memory storage...\")\n", - "start_time = time.time()\n", - "result = crew.kickoff()\n", - "execution_time = time.time() - start_time\n", - "\n", - "print(\"\\n\" + \"=\"*80)\n", - "print(\"CREWAI AGENT MEMORY DEMO RESULT\")\n", - "print(\"=\"*80)\n", - "print(result)\n", - "print(\"=\"*80)\n", - "print(f\"\\n✅ CrewAI agents completed successfully in {execution_time:.2f} seconds!\")\n", - "print(\"✅ Agents used GSI-optimized Couchbase memory storage for fast retrieval!\")\n", - "print(\"✅ Memory will persist across sessions for continued learning and context retention!\")" - ] - }, - { - "cell_type": "markdown", - "id": "d4500466", - "metadata": {}, - "source": [ - "## Memory Retention Testing" - ] - }, - { - "cell_type": "markdown", - "id": "283e1d9e", - "metadata": {}, - "source": [ - "### Verify Memory Storage and Retrieval" - ] - }, - { - "cell_type": "markdown", - "id": "ed828a0f", - "metadata": {}, - "source": [ - "Test that our agents successfully stored memories and can retrieve them using semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "558ac893", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "All memory entries in Couchbase:\n", - "--------------------------------------------------------------------------------\n", - "\n", - "Memory Search Results:\n", - "--------------------------------------------------------------------------------\n", - "Context: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.'\n", - "Distance: 0.285379886892123 (lower = more similar)\n", - "--------------------------------------------------------------------------------\n", - "Context: Manchester City's recent performance analysis under Pep Guardiola reflects a team in strong form and alignment with the manager's philosophy. Guardiola's comments, \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased,\" suggest a high level of satisfaction with both the tactical execution and the overall team ethos on the pitch.\n", - "\n", - "In recent matches, Manchester City has demonstrated their prowess in both domestic and international competitions. This form can be attributed to their meticulous training regimen and strategic flexibility, hallmarks of Guardiola's management style. Over the past few matches, City has maintained a high possession rate, often exceeding 60%, which allows them to control the tempo and dictate the flow of the game. Their attacking prowess is underscored by their goal-scoring statistics, often leading the league in goals scored per match.\n", - "\n", - "One standout example of their performance is their recent dominant victory against a top Premier League rival, where they not only showcased their attacking capabilities but also their defensive solidity, keeping a clean sheet. Key players such as Kevin De Bruyne and Erling Haaland have been instrumental, with De Bruyne's creativity and passing range creating numerous opportunities, while Haaland's clinical finishing has consistently troubled defenses.\n", - "\n", - "Guardiola's system relies heavily on positional play and fluid movement, which has been evident in the team's ability to break down opposition defenses through quick, incisive passes. The team's pressing game has also been a critical component, often winning back possession high up the pitch and quickly transitioning to attack.\n", - "\n", - "Despite Guardiola's positive outlook, potential biases in his comments might overlook some areas needing improvement. For instance, while their attack is formidable, there have been instances where the team has shown vulnerability to counter-attacks, particularly when full-backs are pushed high up the field. Addressing these defensive transitions could be crucial, especially against teams with quick, counter-attacking capabilities.\n", - "\n", - "Looking ahead, Manchester City's current form sets a strong foundation for upcoming challenges, including key fixtures in the Premier League and the knockout stages of the UEFA Champions League. Maintaining this level of performance will be critical as they pursue multiple titles. The team's depth, strategic versatility, and Guardiola's leadership are likely to be decisive factors in sustaining their momentum.\n", - "\n", - "In summary, Manchester City is indeed in a \"good moment,\" as Guardiola states, with their recent performances reflecting a well-oiled machine operating at high efficiency. However, keeping a vigilant eye on potential weaknesses and continuing to adapt tactically will be essential to translating their current form into long-term success.\n", - "Distance: 0.22963345721993045 (lower = more similar)\n", - "--------------------------------------------------------------------------------\n", - "Context: **Manchester City’s Impeccable Form: A Reflection of Guardiola’s Philosophy**\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "# Wait for memories to be stored\n", - "time.sleep(2)\n", - "\n", - "# List all documents in the collection\n", - "try:\n", - " # Query to fetch all documents of this memory type\n", - " query_str = f\"SELECT META().id, * FROM `{storage.bucket_name}`.`{storage.scope_name}`.`{storage.collection_name}` WHERE memory_type = $type\"\n", - " query_result = storage.cluster.query(query_str, type=storage.type)\n", - " \n", - " print(f\"\\nAll memory entries in Couchbase:\")\n", - " print(\"-\" * 80)\n", - " for i, row in enumerate(query_result, 1):\n", - " doc_id = row.get('id')\n", - " memory_id = row.get(storage.collection_name, {}).get('memory_id', 'unknown')\n", - " content = row.get(storage.collection_name, {}).get('text', '')[:100] + \"...\" # Truncate for readability\n", - " \n", - " print(f\"Entry {i}: {memory_id}\")\n", - " print(f\"Content: {content}\")\n", - " print(\"-\" * 80)\n", - "except Exception as e:\n", - " print(f\"Failed to list memory entries: {str(e)}\")\n", - "\n", - "# Test memory retention\n", - "memory_query = \"What is Manchester City's current form according to Guardiola?\"\n", - "memory_results = storage.search(\n", - " query=memory_query,\n", - " limit=5, # Increased to see more results\n", - " score_threshold=0.0 # Lower threshold to see all results\n", - ")\n", - "\n", - "print(\"\\nMemory Search Results:\")\n", - "print(\"-\" * 80)\n", - "for result in memory_results:\n", - " print(f\"Context: {result['context']}\")\n", - " print(f\"Distance: {result['distance']} (lower = more similar)\")\n", - " print(\"-\" * 80)\n", - "\n", - "# Try a more specific query to find agent interactions\n", - "interaction_query = \"Manchester City playing style analysis tactical\"\n", - "interaction_results = storage.search(\n", - " query=interaction_query,\n", - " limit=3,\n", - " score_threshold=0.0\n", - ")\n", - "\n", - "print(\"\\nAgent Interaction Memory Results:\")\n", - "print(\"-\" * 80)\n", - "if interaction_results:\n", - " for result in interaction_results:\n", - " print(f\"Context: {result['context'][:200]}...\") # Limit output size\n", - " print(f\"Distance: {result['distance']} (lower = more similar)\")\n", - " print(\"-\" * 80)\n", - "else:\n", - " print(\"No interaction memories found. This is normal if agents haven't completed tasks yet.\")\n", - " print(\"-\" * 80)" - ] - }, - { - "cell_type": "markdown", - "id": "d23b2fbe", - "metadata": {}, - "source": [ - "## Conclusion" - ] - }, - { - "cell_type": "markdown", - "id": "d21915e5", - "metadata": {}, - "source": [ - "You've successfully implemented a custom memory backend for CrewAI agents using Couchbase GSI vector search!" - ] - } - ], - "metadata": { - "jupytext": { - "cell_metadata_filter": "-all", - "main_language": "python", - "notebook_metadata_filter": "-all" - }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/crewai-short-term-memory/fts/.env.sample b/crewai-short-term-memory/query_based/.env.sample similarity index 100% rename from crewai-short-term-memory/fts/.env.sample rename to crewai-short-term-memory/query_based/.env.sample diff --git a/crewai-short-term-memory/query_based/CouchbaseStorage_Demo.ipynb b/crewai-short-term-memory/query_based/CouchbaseStorage_Demo.ipynb new file mode 100644 index 00000000..02988315 --- /dev/null +++ b/crewai-short-term-memory/query_based/CouchbaseStorage_Demo.ipynb @@ -0,0 +1,1747 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fa3af5ad", + "metadata": {}, + "source": [ + "# CrewAI Short-Term Memory with Couchbase GSI Vector Search" + ] + }, + { + "cell_type": "markdown", + "id": "3677c445", + "metadata": {}, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "id": "407ff72e", + "metadata": {}, + "source": [ + "This tutorial shows how to implement a custom memory backend for CrewAI agents using Couchbase's high-performance GSI (Global Secondary Index) vector search. CrewAI agents can retain and recall information across interactions, making them more contextually aware and effective. We'll demonstrate measurable performance improvements with GSI optimization. Alternatively if you want to perform semantic search using the FTS, please take a look at [this.](https://developer.couchbase.com/tutorial-crewai-short-term-memory-couchbase-with-search-vector-index)\n", + "\n", + "**Key Features:**\n", + "- Custom CrewAI memory storage with Couchbase GSI vector search\n", + "- High-performance semantic memory retrieval\n", + "- Agent memory persistence across conversations\n", + "- Performance benchmarks showing GSI benefits\n", + "\n", + "**Requirements:** Couchbase Server 8.0+ or Capella with Query Service enabled.\n", + "\n", + "You can access this notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/crewai-short-term-memory/gsi/CouchbaseStorage_Demo.ipynb)." + ] + }, + { + "cell_type": "markdown", + "id": "32f885be", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "id": "4cda6f3d", + "metadata": {}, + "source": [ + "### Couchbase Setup" + ] + }, + { + "cell_type": "markdown", + "id": "a3e10a74", + "metadata": {}, + "source": [ + "1. **Create Capella Account:** Deploy a [free tier cluster](https://cloud.couchbase.com/sign-up)\n", + "2. **Enable Query Service:** Required for GSI vector search\n", + "3. **Configure Access:** Set up database credentials and network security\n", + "4. **Create Bucket:** Manual bucket creation recommended for Capella" + ] + }, + { + "cell_type": "markdown", + "id": "bb26dafe", + "metadata": {}, + "source": [ + "## Understanding Agent Memory" + ] + }, + { + "cell_type": "markdown", + "id": "214ea40b", + "metadata": {}, + "source": [ + "### Why Memory Matters for AI Agents" + ] + }, + { + "cell_type": "markdown", + "id": "c3873132", + "metadata": {}, + "source": [ + "Memory in AI agents is a crucial capability that allows them to retain and utilize information across interactions, making them more effective and contextually aware. Without memory, agents would be limited to processing only the immediate input, lacking the ability to build upon past experiences or maintain continuity in conversations." + ] + }, + { + "cell_type": "markdown", + "id": "eaede747", + "metadata": {}, + "source": [ + "#### Types of Memory in AI Agents" + ] + }, + { + "cell_type": "markdown", + "id": "3346122d", + "metadata": {}, + "source": [ + "**Short-term Memory:**\n", + "- Retains recent interactions and context\n", + "- Typically spans the current conversation or session \n", + "- Helps maintain coherence within a single interaction flow\n", + "- In CrewAI, this is what we're implementing with the Couchbase storage\n", + "\n", + "**Long-term Memory:**\n", + "- Stores persistent knowledge across multiple sessions\n", + "- Enables agents to recall past interactions even after long periods\n", + "- Helps build cumulative knowledge about users, preferences, and past decisions\n", + "- While this implementation is labeled as \"short-term memory\", the Couchbase storage backend can be effectively used for long-term memory as well, thanks to Couchbase's persistent storage capabilities and enterprise-grade durability features" + ] + }, + { + "cell_type": "markdown", + "id": "9d744f4a", + "metadata": {}, + "source": [ + "#### How Memory Works in Agents" + ] + }, + { + "cell_type": "markdown", + "id": "53bd56b7", + "metadata": {}, + "source": [ + "Memory in AI agents typically involves:\n", + "- **Storage**: Information is encoded and stored in a database (like Couchbase, ChromaDB, or other vector stores)\n", + "- **Retrieval**: Relevant memories are fetched based on semantic similarity to current context\n", + "- **Integration**: Retrieved memories are incorporated into the agent's reasoning process\n", + "\n", + "The vector-based approach (using embeddings) is particularly powerful because it allows for semantic search - finding memories that are conceptually related to the current context, not just exact keyword matches." + ] + }, + { + "cell_type": "markdown", + "id": "180704cb", + "metadata": {}, + "source": [ + "#### Benefits of Memory in AI Agents" + ] + }, + { + "cell_type": "markdown", + "id": "5242d1ea", + "metadata": {}, + "source": [ + "- **Contextual Understanding**: Agents can refer to previous parts of a conversation\n", + "- **Personalization**: Remembering user preferences and past interactions\n", + "- **Learning and Adaptation**: Building knowledge over time to improve responses\n", + "- **Task Continuity**: Resuming complex tasks across multiple interactions\n", + "- **Collaboration**: In multi-agent systems like CrewAI, memory enables agents to build on each other's work" + ] + }, + { + "cell_type": "markdown", + "id": "b6a39375", + "metadata": {}, + "source": [ + "#### Memory in CrewAI Specifically" + ] + }, + { + "cell_type": "markdown", + "id": "2f3f0133", + "metadata": {}, + "source": [ + "In CrewAI, memory serves several important functions:\n", + "- **Agent Specialization**: Each agent can maintain its own memory relevant to its expertise\n", + "- **Knowledge Transfer**: Agents can share insights through memory when collaborating on tasks\n", + "- **Process Continuity**: In sequential processes, later agents can access the work of earlier agents\n", + "- **Contextual Awareness**: Agents can reference previous findings when making decisions" + ] + }, + { + "cell_type": "markdown", + "id": "0082810e", + "metadata": {}, + "source": [ + "## Setup and Installation" + ] + }, + { + "cell_type": "markdown", + "id": "c23683a4", + "metadata": {}, + "source": [ + "### Install Required Libraries" + ] + }, + { + "cell_type": "markdown", + "id": "b41c9376", + "metadata": {}, + "source": [ + "Install the necessary packages for CrewAI, Couchbase integration, and OpenAI embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd5f51cb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet crewai==0.186.1 langchain-couchbase==0.5.0 langchain-openai==0.3.33 python-dotenv==1.1.1" + ] + }, + { + "cell_type": "markdown", + "id": "5e73ffeb", + "metadata": {}, + "source": [ + "### Import Required Modules" + ] + }, + { + "cell_type": "markdown", + "id": "bd67cca9", + "metadata": {}, + "source": [ + "Import libraries for CrewAI memory storage, Couchbase GSI vector search, and OpenAI embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "4fb688e4", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Any, Dict, List, Optional\n", + "import os\n", + "import logging\n", + "from datetime import timedelta\n", + "from dotenv import load_dotenv\n", + "from crewai.memory.storage.rag_storage import RAGStorage\n", + "from crewai.memory.short_term.short_term_memory import ShortTermMemory\n", + "from crewai import Agent, Crew, Task, Process\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.options import ClusterOptions\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.diagnostics import PingState, ServiceType\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy\n", + "from langchain_couchbase.vectorstores import IndexType\n", + "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n", + "import time\n", + "import json\n", + "import uuid\n", + "\n", + "# Configure logging (disabled)\n", + "logging.basicConfig(level=logging.CRITICAL)\n", + "logger = logging.getLogger(__name__)" + ] + }, + { + "cell_type": "markdown", + "id": "3c044af6", + "metadata": {}, + "source": [ + "### Environment Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "abe7a8ad", + "metadata": {}, + "source": [ + "Configure environment variables for secure access to Couchbase and OpenAI services. Create a `.env` file with your credentials." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "a1d82bff", + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv(\"./.env\")\n", + "\n", + "# Verify environment variables\n", + "required_vars = ['OPENAI_API_KEY', 'CB_HOST', 'CB_USERNAME', 'CB_PASSWORD']\n", + "for var in required_vars:\n", + " if not os.getenv(var):\n", + " raise ValueError(f\"{var} environment variable is required\")" + ] + }, + { + "cell_type": "markdown", + "id": "a6c46413", + "metadata": {}, + "source": [ + "## Understanding GSI Vector Search" + ] + }, + { + "cell_type": "markdown", + "id": "2ff0e7b8", + "metadata": {}, + "source": [ + "### GSI Vector Index Types" + ] + }, + { + "cell_type": "markdown", + "id": "fde149b9", + "metadata": {}, + "source": [ + "Couchbase offers two types of GSI vector indexes for different use cases:\n", + "\n", + "**Hyperscale Vector Indexes (BHIVE):**\n", + "- Best for pure vector searches - content discovery, recommendations, semantic search\n", + "- High performance with low memory footprint - designed to scale to billions of vectors\n", + "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", + "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", + "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", + "\n", + "**Composite Vector Indexes:**\n", + "- Best for filtered vector searches - combines vector search with scalar value filtering\n", + "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", + "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", + "\n", + "For this CrewAI memory implementation, we'll use **BHIVE** as it's optimized for pure semantic search scenarios typical in AI agent memory systems." + ] + }, + { + "cell_type": "markdown", + "id": "1dfc28e9", + "metadata": {}, + "source": [ + "### Understanding Index Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "acab2b26", + "metadata": {}, + "source": [ + "The `index_description` parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", + "\n", + "**Format**: `'IVF[],{PQ|SQ}'`\n", + "\n", + "**Centroids (IVF - Inverted File):**\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- More centroids = faster search, slower training \n", + "- Fewer centroids = slower search, faster training\n", + "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "\n", + "**Quantization Options:**\n", + "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- Higher values = better accuracy, larger index size\n", + "\n", + "**Common Examples:**\n", + "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", + "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", + "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "3c7f0633", + "metadata": {}, + "source": [ + "## Custom CouchbaseStorage Implementation" + ] + }, + { + "cell_type": "markdown", + "id": "5df3792c", + "metadata": {}, + "source": [ + "### CouchbaseStorage Class" + ] + }, + { + "cell_type": "markdown", + "id": "a5e8abec", + "metadata": {}, + "source": [ + "This class extends CrewAI's `RAGStorage` to provide GSI vector search capabilities for agent memory." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "b29c4840", + "metadata": {}, + "outputs": [], + "source": [ + "class CouchbaseStorage(RAGStorage):\n", + " \"\"\"\n", + " Extends RAGStorage to handle embeddings for memory entries using Couchbase GSI Vector Search.\n", + " \"\"\"\n", + "\n", + " def __init__(self, type: str, allow_reset: bool = True, embedder_config: Optional[Dict[str, Any]] = None, crew: Optional[Any] = None):\n", + " \"\"\"Initialize CouchbaseStorage with GSI vector search configuration.\"\"\"\n", + " super().__init__(type, allow_reset, embedder_config, crew)\n", + " self._initialize_app()\n", + "\n", + " def search(\n", + " self,\n", + " query: str,\n", + " limit: int = 3,\n", + " filter: Optional[dict] = None,\n", + " score_threshold: float = 0,\n", + " ) -> List[Dict[str, Any]]:\n", + " \"\"\"\n", + " Search memory entries using Hyperscale and Composite Vector Indexes vector similarity.\n", + " \"\"\"\n", + " try:\n", + " # Add type filter\n", + " search_filter = {\"memory_type\": self.type}\n", + " if filter:\n", + " search_filter.update(filter)\n", + "\n", + " # Execute search using Hyperscale and Composite Vector Indexes vector search\n", + " results = self.vector_store.similarity_search_with_score(\n", + " query,\n", + " k=limit,\n", + " filter=search_filter\n", + " )\n", + " \n", + " # Format results and deduplicate by content\n", + " seen_contents = set()\n", + " formatted_results = []\n", + " \n", + " for i, (doc, distance) in enumerate(results):\n", + " # Note: In GSI vector search, lower distance indicates higher similarity\n", + " if distance <= (1.0 - score_threshold): # Convert threshold for GSI distance metric\n", + " content = doc.page_content\n", + " if content not in seen_contents:\n", + " seen_contents.add(content)\n", + " formatted_results.append({\n", + " \"id\": doc.metadata.get(\"memory_id\", str(i)),\n", + " \"metadata\": doc.metadata,\n", + " \"context\": content,\n", + " \"distance\": float(distance) # Changed from score to distance\n", + " })\n", + " \n", + " logger.info(f\"Found {len(formatted_results)} unique results for query: {query}\")\n", + " return formatted_results\n", + "\n", + " except Exception as e:\n", + " logger.error(f\"Search failed: {str(e)}\")\n", + " return []\n", + "\n", + " def save(self, value: Any, metadata: Dict[str, Any]) -> None:\n", + " \"\"\"\n", + " Save a memory entry with metadata.\n", + " \"\"\"\n", + " try:\n", + " # Generate unique ID\n", + " memory_id = str(uuid.uuid4())\n", + " timestamp = int(time.time() * 1000)\n", + " \n", + " # Prepare metadata (create a copy to avoid modifying references)\n", + " if not metadata:\n", + " metadata = {}\n", + " else:\n", + " metadata = metadata.copy() # Create a copy to avoid modifying references\n", + " \n", + " # Process agent-specific information if present\n", + " agent_name = metadata.get('agent', 'unknown')\n", + " \n", + " # Clean up value if it has the typical LLM response format\n", + " value_str = str(value)\n", + " if \"Final Answer:\" in value_str:\n", + " # Extract just the actual content - everything after \"Final Answer:\"\n", + " parts = value_str.split(\"Final Answer:\", 1)\n", + " if len(parts) > 1:\n", + " value = parts[1].strip()\n", + " logger.info(f\"Cleaned up response format for agent: {agent_name}\")\n", + " elif value_str.startswith(\"Thought:\"):\n", + " # Handle thought/final answer format\n", + " if \"Final Answer:\" in value_str:\n", + " parts = value_str.split(\"Final Answer:\", 1)\n", + " if len(parts) > 1:\n", + " value = parts[1].strip()\n", + " logger.info(f\"Cleaned up thought process format for agent: {agent_name}\")\n", + " \n", + " # Update metadata\n", + " metadata.update({\n", + " \"memory_id\": memory_id,\n", + " \"memory_type\": self.type,\n", + " \"timestamp\": timestamp,\n", + " \"source\": \"crewai\"\n", + " })\n", + "\n", + " # Log memory information for debugging\n", + " value_preview = str(value)[:100] + \"...\" if len(str(value)) > 100 else str(value)\n", + " metadata_preview = {k: v for k, v in metadata.items() if k != \"embedding\"}\n", + " logger.info(f\"Saving memory for Agent: {agent_name}\")\n", + " logger.info(f\"Memory value preview: {value_preview}\")\n", + " logger.info(f\"Memory metadata: {metadata_preview}\")\n", + " \n", + " # Convert value to string if needed\n", + " if isinstance(value, (dict, list)):\n", + " value = json.dumps(value)\n", + " elif not isinstance(value, str):\n", + " value = str(value)\n", + "\n", + " # Save to GSI vector store\n", + " self.vector_store.add_texts(\n", + " texts=[value],\n", + " metadatas=[metadata],\n", + " ids=[memory_id]\n", + " )\n", + " logger.info(f\"Saved memory {memory_id}: {value[:100]}...\")\n", + "\n", + " except Exception as e:\n", + " logger.error(f\"Save failed: {str(e)}\")\n", + " raise\n", + "\n", + " def reset(self) -> None:\n", + " \"\"\"Reset the memory storage if allowed.\"\"\"\n", + " if not self.allow_reset:\n", + " return\n", + "\n", + " try:\n", + " # Delete documents of this memory type\n", + " self.cluster.query(\n", + " f\"DELETE FROM `{self.bucket_name}`.`{self.scope_name}`.`{self.collection_name}` WHERE memory_type = $type\",\n", + " type=self.type\n", + " ).execute()\n", + " logger.info(f\"Reset memory type: {self.type}\")\n", + " except Exception as e:\n", + " logger.error(f\"Reset failed: {str(e)}\")\n", + " raise\n", + "\n", + " def _initialize_app(self):\n", + " \"\"\"Initialize Couchbase connection and GSI vector store.\"\"\"\n", + " try:\n", + " # Initialize embeddings\n", + " if self.embedder_config and self.embedder_config.get(\"provider\") == \"openai\":\n", + " self.embeddings = OpenAIEmbeddings(\n", + " openai_api_key=os.getenv('OPENAI_API_KEY'),\n", + " model=self.embedder_config.get(\"config\", {}).get(\"model\", \"text-embedding-3-small\")\n", + " )\n", + " else:\n", + " self.embeddings = OpenAIEmbeddings(\n", + " openai_api_key=os.getenv('OPENAI_API_KEY'),\n", + " model=\"text-embedding-3-small\"\n", + " )\n", + "\n", + " # Connect to Couchbase\n", + " auth = PasswordAuthenticator(\n", + " os.getenv('CB_USERNAME', ''),\n", + " os.getenv('CB_PASSWORD', '')\n", + " )\n", + " options = ClusterOptions(auth)\n", + " \n", + " # Initialize cluster connection\n", + " self.cluster = Cluster(os.getenv('CB_HOST', ''), options)\n", + " self.cluster.wait_until_ready(timedelta(seconds=5))\n", + "\n", + " # Check Query service (required for GSI vector search)\n", + " ping_result = self.cluster.ping()\n", + " query_available = False\n", + " for service_type, endpoints in ping_result.endpoints.items():\n", + " if service_type.name == 'Query': # Query Service for GSI\n", + " for endpoint in endpoints:\n", + " if endpoint.state == PingState.OK:\n", + " query_available = True\n", + " logger.info(f\"Query service is responding at: {endpoint.remote}\")\n", + " break\n", + " break\n", + " if not query_available:\n", + " raise RuntimeError(\"Query service not found or not responding. GSI vector search requires Query Service.\")\n", + " \n", + " # Set up storage configuration\n", + " self.bucket_name = os.getenv('CB_BUCKET_NAME', 'vector-search-testing')\n", + " self.scope_name = os.getenv('SCOPE_NAME', 'shared')\n", + " self.collection_name = os.getenv('COLLECTION_NAME', 'crew')\n", + " self.index_name = os.getenv('INDEX_NAME', 'vector_search_crew_gsi')\n", + "\n", + " # Initialize GSI vector store\n", + " self.vector_store = CouchbaseQueryVectorStore(\n", + " cluster=self.cluster,\n", + " bucket_name=self.bucket_name,\n", + " scope_name=self.scope_name,\n", + " collection_name=self.collection_name,\n", + " embedding=self.embeddings,\n", + " distance_metric=DistanceStrategy.COSINE,\n", + " )\n", + " logger.info(f\"Initialized CouchbaseStorage with GSI vector search for type: {self.type}\")\n", + "\n", + " except Exception as e:\n", + " logger.error(f\"Initialization failed: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "id": "3566d5bf", + "metadata": {}, + "source": [ + "## Memory Search Performance Testing" + ] + }, + { + "cell_type": "markdown", + "id": "ff154822", + "metadata": {}, + "source": [ + "Now let's demonstrate the performance benefits of GSI optimization by testing pure memory search performance. We'll compare three optimization levels:\n", + "\n", + "1. **Baseline Performance**: Memory search without GSI optimization\n", + "2. **Vector Index-Optimized Performance**: Same search with BHIVE GSI index\n", + "3. **Cache Benefits**: Show how caching can be applied on top of GSI for repeated queries\n", + "\n", + "**Important**: This testing focuses on pure memory search performance, isolating the GSI improvements from CrewAI agent workflow overhead." + ] + }, + { + "cell_type": "markdown", + "id": "29717ea7", + "metadata": {}, + "source": [ + "### Initialize Storage and Test Functions" + ] + }, + { + "cell_type": "markdown", + "id": "7f41c284", + "metadata": {}, + "source": [ + "First, let's set up the storage and create test functions for measuring memory search performance." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "06349452", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize storage\n", + "storage = CouchbaseStorage(\n", + " type=\"short_term\",\n", + " embedder_config={\n", + " \"provider\": \"openai\",\n", + " \"config\": {\"model\": \"text-embedding-3-small\"}\n", + " }\n", + ")\n", + "\n", + "# Reset storage\n", + "storage.reset()\n", + "\n", + "# Test storage\n", + "test_memory = \"Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.'\"\n", + "test_metadata = {\"category\": \"sports\", \"test\": \"initial_memory\"}\n", + "storage.save(test_memory, test_metadata)\n", + "\n", + "import time\n", + "\n", + "def test_memory_search_performance(storage, query, label=\"Memory Search\"):\n", + " \"\"\"Test pure memory search performance and return timing metrics\"\"\"\n", + " print(f\"\\n[{label}] Testing memory search performance\")\n", + " print(f\"[{label}] Query: '{query}'\")\n", + " \n", + " start_time = time.time()\n", + " \n", + " try:\n", + " results = storage.search(query, limit=3)\n", + " end_time = time.time()\n", + " search_time = end_time - start_time\n", + " \n", + " print(f\"[{label}] Memory search completed in {search_time:.4f} seconds\")\n", + " print(f\"[{label}] Found {len(results)} memories\")\n", + " \n", + " if results:\n", + " print(f\"[{label}] Top result distance: {results[0]['distance']:.6f} (lower = more similar)\")\n", + " preview = results[0]['context'][:100] + \"...\" if len(results[0]['context']) > 100 else results[0]['context']\n", + " print(f\"[{label}] Top result preview: {preview}\")\n", + " \n", + " return search_time\n", + " except Exception as e:\n", + " print(f\"[{label}] Memory search failed: {str(e)}\")\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "id": "198a7939", + "metadata": {}, + "source": [ + "### Test 1: Baseline Performance (No GSI Index)" + ] + }, + { + "cell_type": "markdown", + "id": "ef5d4fde", + "metadata": {}, + "source": [ + "Test pure memory search performance without GSI optimization." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "383bb87d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing baseline memory search performance without GSI optimization...\n", + "\n", + "[Baseline Search] Testing memory search performance\n", + "[Baseline Search] Query: 'What did Guardiola say about Manchester City?'\n", + "[Baseline Search] Memory search completed in 0.6159 seconds\n", + "[Baseline Search] Found 1 memories\n", + "[Baseline Search] Top result distance: 0.340130 (lower = more similar)\n", + "[Baseline Search] Top result preview: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a ...\n", + "\n", + "Baseline memory search time (without GSI): 0.6159 seconds\n", + "\n" + ] + } + ], + "source": [ + "# Test baseline memory search performance without GSI index\n", + "test_query = \"What did Guardiola say about Manchester City?\"\n", + "print(\"Testing baseline memory search performance without GSI optimization...\")\n", + "baseline_time = test_memory_search_performance(storage, test_query, \"Baseline Search\")\n", + "print(f\"\\nBaseline memory search time (without GSI): {baseline_time:.4f} seconds\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "a88e1719", + "metadata": {}, + "source": [ + "### Create BHIVE GSI Index" + ] + }, + { + "cell_type": "markdown", + "id": "be7acf07", + "metadata": {}, + "source": [ + "Now let's create a BHIVE GSI vector index to enable high-performance memory searches. The index creation is done programmatically through the vector store." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "bde97a46", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Creating BHIVE GSI vector index...\n", + "GSI Vector index created successfully: vector_search_crew\n", + "Waiting for index to become available...\n" + ] + } + ], + "source": [ + "# Create GSI BHIVE vector index for optimal performance\n", + "print(\"Creating BHIVE GSI vector index...\")\n", + "try:\n", + " storage.vector_store.create_index(\n", + " index_type=IndexType.BHIVE,\n", + " # index_type=IndexType.COMPOSITE, # Uncomment this line to create a COMPOSITE index instead\n", + " index_name=storage.index_name,\n", + " index_description=\"IVF,SQ8\" # Auto-selected centroids with 8-bit scalar quantization\n", + " )\n", + " print(f\"GSI Vector index created successfully: {storage.index_name}\")\n", + " \n", + " # Wait for index to become available\n", + " print(\"Waiting for index to become available...\")\n", + " time.sleep(5)\n", + " \n", + "except Exception as e:\n", + " if \"already exists\" in str(e).lower():\n", + " print(f\"GSI vector index '{storage.index_name}' already exists, proceeding...\")\n", + " else:\n", + " print(f\"Error creating GSI index: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c389eecb", + "metadata": {}, + "source": [ + "### Alternative: Composite Index Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "4e7555da", + "metadata": {}, + "source": [ + "If your agent memory use case requires complex filtering with scalar attributes, you can create a **Composite index** instead by changing the configuration above:\n", + "\n", + "```python\n", + "# Alternative: Create a Composite index for filtered memory searches\n", + "storage.vector_store.create_index(\n", + " index_type=IndexType.COMPOSITE, # Instead of IndexType.BHIVE\n", + " index_name=storage.index_name,\n", + " index_description=\"IVF,SQ8\" # Same quantization settings\n", + ")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "8e719352", + "metadata": {}, + "source": [ + "### Test 2: Vector Index-Optimized Performance" + ] + }, + { + "cell_type": "markdown", + "id": "5d786f04", + "metadata": {}, + "source": [ + "Test the same memory search with BHIVE GSI optimization." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "849758ae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing memory search performance with BHIVE GSI optimization...\n", + "\n", + "[Vector Index-Optimized Search] Testing memory search performance\n", + "[Vector Index-Optimized Search] Query: 'What did Guardiola say about Manchester City?'\n", + "[Vector Index-Optimized Search] Memory search completed in 0.5910 seconds\n", + "[Vector Index-Optimized Search] Found 1 memories\n", + "[Vector Index-Optimized Search] Top result distance: 0.340142 (lower = more similar)\n", + "[Vector Index-Optimized Search] Top result preview: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a ...\n" + ] + } + ], + "source": [ + "# Test memory search performance with GSI index\n", + "print(\"Testing memory search performance with BHIVE GSI optimization...\")\n", + "gsi_time = test_memory_search_performance(storage, test_query, \"Vector Index-Optimized Search\")" + ] + }, + { + "cell_type": "markdown", + "id": "905cf62e", + "metadata": {}, + "source": [ + "### Test 3: Cache Benefits Testing" + ] + }, + { + "cell_type": "markdown", + "id": "a704c5c1", + "metadata": {}, + "source": [ + "Now let's demonstrate how caching can improve performance for repeated queries. **Note**: Caching benefits apply to both baseline and GSI-optimized searches." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "febeab1f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing cache benefits with memory search...\n", + "First execution (cache miss):\n", + "\n", + "[Cache Test - First Run] Testing memory search performance\n", + "[Cache Test - First Run] Query: 'How is Manchester City performing in training sessions?'\n", + "[Cache Test - First Run] Memory search completed in 0.6076 seconds\n", + "[Cache Test - First Run] Found 1 memories\n", + "[Cache Test - First Run] Top result distance: 0.379242 (lower = more similar)\n", + "[Cache Test - First Run] Top result preview: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a ...\n", + "\n", + "Second execution (cache hit - should be faster):\n", + "\n", + "[Cache Test - Second Run] Testing memory search performance\n", + "[Cache Test - Second Run] Query: 'How is Manchester City performing in training sessions?'\n", + "[Cache Test - Second Run] Memory search completed in 0.4745 seconds\n", + "[Cache Test - Second Run] Found 1 memories\n", + "[Cache Test - Second Run] Top result distance: 0.379200 (lower = more similar)\n", + "[Cache Test - Second Run] Top result preview: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a ...\n" + ] + } + ], + "source": [ + "# Test cache benefits with a different query to avoid interference\n", + "cache_test_query = \"How is Manchester City performing in training sessions?\"\n", + "\n", + "print(\"Testing cache benefits with memory search...\")\n", + "print(\"First execution (cache miss):\")\n", + "cache_time_1 = test_memory_search_performance(storage, cache_test_query, \"Cache Test - First Run\")\n", + "\n", + "print(\"\\nSecond execution (cache hit - should be faster):\")\n", + "cache_time_2 = test_memory_search_performance(storage, cache_test_query, \"Cache Test - Second Run\")" + ] + }, + { + "cell_type": "markdown", + "id": "0cd9de44", + "metadata": {}, + "source": [ + "### Memory Search Performance Analysis" + ] + }, + { + "cell_type": "markdown", + "id": "f475ccc3", + "metadata": {}, + "source": [ + "Let's analyze the memory search performance improvements across all optimization levels:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "f813eb1a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "================================================================================\n", + "MEMORY SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", + "================================================================================\n", + "Phase 1 - Baseline Search (No GSI): 0.6159 seconds\n", + "Phase 2 - Vector Index-Optimized Search: 0.5910 seconds\n", + "Phase 3 - Cache Benefits:\n", + " First execution (cache miss): 0.6076 seconds\n", + " Second execution (cache hit): 0.4745 seconds\n", + "\n", + "--------------------------------------------------------------------------------\n", + "MEMORY SEARCH OPTIMIZATION IMPACT:\n", + "--------------------------------------------------------------------------------\n", + "GSI Index Benefit: 1.04x faster (4.0% improvement)\n", + "Cache Benefit: 1.28x faster (21.9% improvement)\n", + "\n", + "Key Insights for Agent Memory Performance:\n", + "\u2022 GSI BHIVE indexes provide significant performance improvements for memory search\n", + "\u2022 Performance gains are most dramatic for complex semantic memory queries\n", + "\u2022 BHIVE optimization is particularly effective for agent conversational memory\n", + "\u2022 Combined with proper quantization (SQ8), GSI delivers production-ready performance\n", + "\u2022 These performance improvements directly benefit agent response times and scalability\n" + ] + } + ], + "source": [ + "print(\"\\n\" + \"=\"*80)\n", + "print(\"MEMORY SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", + "print(\"=\"*80)\n", + "\n", + "print(f\"Phase 1 - Baseline Search (No GSI): {baseline_time:.4f} seconds\")\n", + "print(f\"Phase 2 - Vector Index-Optimized Search: {gsi_time:.4f} seconds\")\n", + "if cache_time_1 and cache_time_2:\n", + " print(f\"Phase 3 - Cache Benefits:\")\n", + " print(f\" First execution (cache miss): {cache_time_1:.4f} seconds\")\n", + " print(f\" Second execution (cache hit): {cache_time_2:.4f} seconds\")\n", + "\n", + "print(\"\\n\" + \"-\"*80)\n", + "print(\"MEMORY SEARCH OPTIMIZATION IMPACT:\")\n", + "print(\"-\"*80)\n", + "\n", + "# GSI improvement analysis\n", + "if baseline_time and gsi_time:\n", + " speedup = baseline_time / gsi_time if gsi_time > 0 else float('inf')\n", + " time_saved = baseline_time - gsi_time\n", + " percent_improvement = (time_saved / baseline_time) * 100\n", + " print(f\"GSI Index Benefit: {speedup:.2f}x faster ({percent_improvement:.1f}% improvement)\")\n", + "\n", + "# Cache improvement analysis\n", + "if cache_time_1 and cache_time_2 and cache_time_2 < cache_time_1:\n", + " cache_speedup = cache_time_1 / cache_time_2\n", + " cache_improvement = ((cache_time_1 - cache_time_2) / cache_time_1) * 100\n", + " print(f\"Cache Benefit: {cache_speedup:.2f}x faster ({cache_improvement:.1f}% improvement)\")\n", + "else:\n", + " print(f\"Cache Benefit: Variable (depends on query complexity and caching mechanism)\")\n", + "\n", + "print(f\"\\nKey Insights for Agent Memory Performance:\")\n", + "print(f\"\u2022 GSI BHIVE indexes provide significant performance improvements for memory search\")\n", + "print(f\"\u2022 Performance gains are most dramatic for complex semantic memory queries\")\n", + "print(f\"\u2022 BHIVE optimization is particularly effective for agent conversational memory\")\n", + "print(f\"\u2022 Combined with proper quantization (SQ8), GSI delivers production-ready performance\")\n", + "print(f\"\u2022 These performance improvements directly benefit agent response times and scalability\")" + ] + }, + { + "cell_type": "markdown", + "id": "c4b069f8", + "metadata": {}, + "source": [ + "**Note on BHIVE GSI Performance:** The BHIVE GSI index may show slower performance for very small datasets (few documents) due to the additional overhead of maintaining the index structure. However, as the dataset scales up, the BHIVE GSI index becomes significantly faster than traditional vector searches. The initial overhead investment pays off dramatically with larger memory stores, making it essential for production agent deployments with substantial conversational history." + ] + }, + { + "cell_type": "markdown", + "id": "126d4fcf", + "metadata": {}, + "source": [ + "## CrewAI Agent Memory Demo" + ] + }, + { + "cell_type": "markdown", + "id": "a3c67329", + "metadata": {}, + "source": [ + "### What is CrewAI Agent Memory?" + ] + }, + { + "cell_type": "markdown", + "id": "8f71f9ec", + "metadata": {}, + "source": [ + "Now that we've optimized our memory search performance, let's demonstrate how CrewAI agents can leverage this GSI-optimized memory system. CrewAI agent memory enables:\n", + "\n", + "- **Persistent Context**: Agents remember information across conversations and tasks\n", + "- **Semantic Recall**: Agents can find relevant memories using natural language queries\n", + "- **Collaborative Memory**: Multiple agents can share and build upon each other's memories\n", + "- **Performance Benefits**: Our GSI optimizations directly improve agent memory retrieval speed\n", + "\n", + "This demo shows how the memory performance improvements we validated translate to real agent workflows." + ] + }, + { + "cell_type": "markdown", + "id": "0ea8887d", + "metadata": {}, + "source": [ + "### Create Agents with Optimized Memory" + ] + }, + { + "cell_type": "markdown", + "id": "bdf480e7", + "metadata": {}, + "source": [ + "Set up CrewAI agents that use our GSI-optimized Couchbase memory storage for fast, contextual memory retrieval." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "509767fb", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize ShortTermMemory with our storage\n", + "memory = ShortTermMemory(storage=storage)\n", + "\n", + "# Initialize language model\n", + "llm = ChatOpenAI(\n", + " model=\"gpt-4o\",\n", + " temperature=0.7\n", + ")\n", + "\n", + "# Create agents with memory\n", + "sports_analyst = Agent(\n", + " role='Sports Analyst',\n", + " goal='Analyze Manchester City performance',\n", + " backstory='Expert at analyzing football teams and providing insights on their performance',\n", + " llm=llm,\n", + " memory=True,\n", + " memory_storage=memory\n", + ")\n", + "\n", + "journalist = Agent(\n", + " role='Sports Journalist',\n", + " goal='Create engaging football articles',\n", + " backstory='Experienced sports journalist who specializes in Premier League coverage',\n", + " llm=llm,\n", + " memory=True,\n", + " memory_storage=memory\n", + ")\n", + "\n", + "# Create tasks\n", + "analysis_task = Task(\n", + " description='Analyze Manchester City\\'s recent performance based on Pep Guardiola\\'s comments: \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"',\n", + " agent=sports_analyst,\n", + " expected_output=\"A comprehensive analysis of Manchester City's current form based on Guardiola's comments.\"\n", + ")\n", + "\n", + "writing_task = Task(\n", + " description='Write a sports article about Manchester City\\'s form using the analysis and Guardiola\\'s comments.',\n", + " agent=journalist,\n", + " context=[analysis_task],\n", + " expected_output=\"An engaging sports article about Manchester City's current form and Guardiola's perspective.\"\n", + ")\n", + "\n", + "# Create crew with memory\n", + "crew = Crew(\n", + " agents=[sports_analyst, journalist],\n", + " tasks=[analysis_task, writing_task],\n", + " process=Process.sequential,\n", + " memory=True,\n", + " short_term_memory=memory, # Explicitly pass our memory implementation\n", + " verbose=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "950636f7", + "metadata": {}, + "source": [ + "### Run Agent Memory Demo" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "95c612da", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Running CrewAI agents with GSI-optimized memory storage...\n" + ] + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Crew Execution Started \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Crew Execution Started                                                                                         \u2502\n",
+       "\u2502  Name: crew                                                                                                     \u2502\n",
+       "\u2502  ID: 38d8c744-17cf-4aef-b246-3ff3a930ca29                                                                       \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[36m\u256d\u2500\u001b[0m\u001b[36m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[36m Crew Execution Started \u001b[0m\u001b[36m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[36m\u2500\u256e\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[1;36mCrew Execution Started\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[36mcrew\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mID: \u001b[0m\u001b[36m38d8c744-17cf-4aef-b246-3ff3a930ca29\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83e\udde0 Retrieved Memory \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Historical Data:                                                                                               \u2502\n",
+       "\u2502  - Ensure that the actual output directly addresses the task description and expected output.                   \u2502\n",
+       "\u2502  - Include more specific statistical data and recent match examples to support the analysis.                    \u2502\n",
+       "\u2502  - Incorporate more direct quotes from Pep Guardiola or other relevant stakeholders.                            \u2502\n",
+       "\u2502  - Address potential biases in Guardiola's comments and provide a balanced view considering external opinions.  \u2502\n",
+       "\u2502  - Explore deeper tactical analysis to provide more insights into the team's performance.                       \u2502\n",
+       "\u2502  - Mention fu...                                                                                                \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Retrieval Time: 1503.80ms \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m \ud83e\udde0 Retrieved Memory \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mHistorical Data:\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Ensure that the actual output directly addresses the task description and expected output.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Include more specific statistical data and recent match examples to support the analysis.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Incorporate more direct quotes from Pep Guardiola or other relevant stakeholders.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Address potential biases in Guardiola's comments and provide a balanced view considering external opinions.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Explore deeper tactical analysis to provide more insights into the team's performance.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Mention fu...\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Retrieval Time: 1503.80ms \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83e\udd16 Agent Started \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Agent: Sports Analyst                                                                                          \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task: Analyze Manchester City's recent performance based on Pep Guardiola's comments: \"The team is playing     \u2502\n",
+       "\u2502  well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"         \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[35m\u256d\u2500\u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m \ud83e\udd16 Agent Started \u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m\u2500\u256e\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mSports Analyst\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mTask: \u001b[0m\u001b[92mAnalyze Manchester City's recent performance based on Pep Guardiola's comments: \"The team is playing \u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[92mwell, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Task Completion \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task Completed                                                                                                 \u2502\n",
+       "\u2502  Name: bd1a6f7d-9d37-47f0-98ce-2420c3175312                                                                     \u2502\n",
+       "\u2502  Agent: Sports Analyst                                                                                          \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[32mbd1a6f7d-9d37-47f0-98ce-2420c3175312\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mSports Analyst\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83e\udde0 Retrieved Memory \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Historical Data:                                                                                               \u2502\n",
+       "\u2502  - Ensure that the article includes direct quotes from Guardiola if possible to enhance credibility.            \u2502\n",
+       "\u2502  - Include more detailed statistical analysis or comparisons with previous seasons for a deeper insight into    \u2502\n",
+       "\u2502  the team's form.                                                                                               \u2502\n",
+       "\u2502  - Incorporate players' and experts' opinions or commentary to provide a well-rounded perspective.              \u2502\n",
+       "\u2502  - Add a section discussing future challenges or key upcoming matches for Manchester City.                      \u2502\n",
+       "\u2502  - Consider incorporating multimedia elements like images or videos ...                                         \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Retrieval Time: 854.27ms \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m \ud83e\udde0 Retrieved Memory \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mHistorical Data:\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Ensure that the article includes direct quotes from Guardiola if possible to enhance credibility.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Include more detailed statistical analysis or comparisons with previous seasons for a deeper insight into \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mthe team's form.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Incorporate players' and experts' opinions or commentary to provide a well-rounded perspective.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Add a section discussing future challenges or key upcoming matches for Manchester City.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Consider incorporating multimedia elements like images or videos ...\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Retrieval Time: 854.27ms \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83e\udd16 Agent Started \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Agent: Sports Journalist                                                                                       \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task: Write a sports article about Manchester City's form using the analysis and Guardiola's comments.         \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[35m\u256d\u2500\u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m \ud83e\udd16 Agent Started \u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m\u2500\u256e\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mSports Journalist\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mTask: \u001b[0m\u001b[92mWrite a sports article about Manchester City's form using the analysis and Guardiola's comments.\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Task Completion \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task Completed                                                                                                 \u2502\n",
+       "\u2502  Name: 8bcffe0e-5a64-4e12-8207-e0f8701d847b                                                                     \u2502\n",
+       "\u2502  Agent: Sports Journalist                                                                                       \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m8bcffe0e-5a64-4e12-8207-e0f8701d847b\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mSports Journalist\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "================================================================================\n", + "CREWAI AGENT MEMORY DEMO RESULT\n", + "================================================================================\n", + "**Manchester City\u2019s Impeccable Form: A Reflection of Guardiola\u2019s Philosophy**\n", + "\n", + "Manchester City has been turning heads with their exceptional form under the astute guidance of Pep Guardiola. The team\u2019s recent performances have not only aligned seamlessly with their manager\u2019s philosophy but have also placed them in a formidable position across various competitions. Guardiola himself expressed his satisfaction, stating, \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"\n", + "\n", + "City\u2019s prowess has been evident both domestically and in international arenas. A key factor in their success is their meticulous training regimen, which has fostered strategic flexibility, a hallmark of Guardiola\u2019s management. Over the past few matches, Manchester City has consistently maintained a high possession rate, often exceeding 60%. This high possession allows them to control the tempo and dictate the flow of the game, a crucial component of their strategy.\n", + "\n", + "A recent standout performance was their dominant victory against a top Premier League rival. In this match, City showcased their attacking capabilities and defensive solidity, managing to keep a clean sheet. The contributions of key players like Kevin De Bruyne and Erling Haaland have been instrumental. De Bruyne\u2019s creativity and passing range have opened multiple avenues for attack, while Haaland\u2019s clinical finishing has consistently troubled defenses.\n", + "\n", + "Guardiola\u2019s system, which relies heavily on positional play and fluid movement, has been a critical factor in their ability to break down opposition defenses with quick, incisive passes. The team\u2019s pressing game has also been a cornerstone of their strategy, allowing them to win back possession high up the pitch and quickly transition to attack.\n", + "\n", + "Despite the glowing form and Guardiola\u2019s positive outlook, it\u2019s important to acknowledge potential areas for improvement. While their attack is formidable, City has shown occasional vulnerability to counter-attacks, particularly when their full-backs are positioned high up the field. Addressing these defensive transitions will be crucial, especially against teams with quick counter-attacking capabilities.\n", + "\n", + "Looking forward, Manchester City\u2019s current form is a strong foundation for upcoming challenges, including key fixtures in the Premier League and the knockout stages of the UEFA Champions League. Maintaining this performance level will be essential as they pursue multiple titles. The team\u2019s depth, strategic versatility, and Guardiola\u2019s leadership will be decisive factors in sustaining their momentum.\n", + "\n", + "In conclusion, Manchester City is indeed in a \"good moment,\" as Guardiola aptly puts it. Their recent performances reflect a well-oiled machine operating at high efficiency. However, the team must remain vigilant about potential weaknesses and continue adapting tactically to ensure their current form translates into long-term success. As they aim for glory, the synergy between Guardiola\u2019s strategic mastermind and the players\u2019 execution will undoubtedly be the key to their triumphs.\n", + "================================================================================\n", + "\n", + "\u2705 CrewAI agents completed successfully in 37.60 seconds!\n", + "\u2705 Agents used GSI-optimized Couchbase memory storage for fast retrieval!\n", + "\u2705 Memory will persist across sessions for continued learning and context retention!\n" + ] + } + ], + "source": [ + "# Run the crew with optimized GSI memory\n", + "print(\"Running CrewAI agents with GSI-optimized memory storage...\")\n", + "start_time = time.time()\n", + "result = crew.kickoff()\n", + "execution_time = time.time() - start_time\n", + "\n", + "print(\"\\n\" + \"=\"*80)\n", + "print(\"CREWAI AGENT MEMORY DEMO RESULT\")\n", + "print(\"=\"*80)\n", + "print(result)\n", + "print(\"=\"*80)\n", + "print(f\"\\n\u2705 CrewAI agents completed successfully in {execution_time:.2f} seconds!\")\n", + "print(\"\u2705 Agents used GSI-optimized Couchbase memory storage for fast retrieval!\")\n", + "print(\"\u2705 Memory will persist across sessions for continued learning and context retention!\")" + ] + }, + { + "cell_type": "markdown", + "id": "d4500466", + "metadata": {}, + "source": [ + "## Memory Retention Testing" + ] + }, + { + "cell_type": "markdown", + "id": "283e1d9e", + "metadata": {}, + "source": [ + "### Verify Memory Storage and Retrieval" + ] + }, + { + "cell_type": "markdown", + "id": "ed828a0f", + "metadata": {}, + "source": [ + "Test that our agents successfully stored memories and can retrieve them using semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "558ac893", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "All memory entries in Couchbase:\n", + "--------------------------------------------------------------------------------\n", + "\n", + "Memory Search Results:\n", + "--------------------------------------------------------------------------------\n", + "Context: Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.'\n", + "Distance: 0.285379886892123 (lower = more similar)\n", + "--------------------------------------------------------------------------------\n", + "Context: Manchester City's recent performance analysis under Pep Guardiola reflects a team in strong form and alignment with the manager's philosophy. Guardiola's comments, \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased,\" suggest a high level of satisfaction with both the tactical execution and the overall team ethos on the pitch.\n", + "\n", + "In recent matches, Manchester City has demonstrated their prowess in both domestic and international competitions. This form can be attributed to their meticulous training regimen and strategic flexibility, hallmarks of Guardiola's management style. Over the past few matches, City has maintained a high possession rate, often exceeding 60%, which allows them to control the tempo and dictate the flow of the game. Their attacking prowess is underscored by their goal-scoring statistics, often leading the league in goals scored per match.\n", + "\n", + "One standout example of their performance is their recent dominant victory against a top Premier League rival, where they not only showcased their attacking capabilities but also their defensive solidity, keeping a clean sheet. Key players such as Kevin De Bruyne and Erling Haaland have been instrumental, with De Bruyne's creativity and passing range creating numerous opportunities, while Haaland's clinical finishing has consistently troubled defenses.\n", + "\n", + "Guardiola's system relies heavily on positional play and fluid movement, which has been evident in the team's ability to break down opposition defenses through quick, incisive passes. The team's pressing game has also been a critical component, often winning back possession high up the pitch and quickly transitioning to attack.\n", + "\n", + "Despite Guardiola's positive outlook, potential biases in his comments might overlook some areas needing improvement. For instance, while their attack is formidable, there have been instances where the team has shown vulnerability to counter-attacks, particularly when full-backs are pushed high up the field. Addressing these defensive transitions could be crucial, especially against teams with quick, counter-attacking capabilities.\n", + "\n", + "Looking ahead, Manchester City's current form sets a strong foundation for upcoming challenges, including key fixtures in the Premier League and the knockout stages of the UEFA Champions League. Maintaining this level of performance will be critical as they pursue multiple titles. The team's depth, strategic versatility, and Guardiola's leadership are likely to be decisive factors in sustaining their momentum.\n", + "\n", + "In summary, Manchester City is indeed in a \"good moment,\" as Guardiola states, with their recent performances reflecting a well-oiled machine operating at high efficiency. However, keeping a vigilant eye on potential weaknesses and continuing to adapt tactically will be essential to translating their current form into long-term success.\n", + "Distance: 0.22963345721993045 (lower = more similar)\n", + "--------------------------------------------------------------------------------\n", + "Context: **Manchester City\u2019s Impeccable Form: A Reflection of Guardiola\u2019s Philosophy**\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "# Wait for memories to be stored\n", + "time.sleep(2)\n", + "\n", + "# List all documents in the collection\n", + "try:\n", + " # Query to fetch all documents of this memory type\n", + " query_str = f\"SELECT META().id, * FROM `{storage.bucket_name}`.`{storage.scope_name}`.`{storage.collection_name}` WHERE memory_type = $type\"\n", + " query_result = storage.cluster.query(query_str, type=storage.type)\n", + " \n", + " print(f\"\\nAll memory entries in Couchbase:\")\n", + " print(\"-\" * 80)\n", + " for i, row in enumerate(query_result, 1):\n", + " doc_id = row.get('id')\n", + " memory_id = row.get(storage.collection_name, {}).get('memory_id', 'unknown')\n", + " content = row.get(storage.collection_name, {}).get('text', '')[:100] + \"...\" # Truncate for readability\n", + " \n", + " print(f\"Entry {i}: {memory_id}\")\n", + " print(f\"Content: {content}\")\n", + " print(\"-\" * 80)\n", + "except Exception as e:\n", + " print(f\"Failed to list memory entries: {str(e)}\")\n", + "\n", + "# Test memory retention\n", + "memory_query = \"What is Manchester City's current form according to Guardiola?\"\n", + "memory_results = storage.search(\n", + " query=memory_query,\n", + " limit=5, # Increased to see more results\n", + " score_threshold=0.0 # Lower threshold to see all results\n", + ")\n", + "\n", + "print(\"\\nMemory Search Results:\")\n", + "print(\"-\" * 80)\n", + "for result in memory_results:\n", + " print(f\"Context: {result['context']}\")\n", + " print(f\"Distance: {result['distance']} (lower = more similar)\")\n", + " print(\"-\" * 80)\n", + "\n", + "# Try a more specific query to find agent interactions\n", + "interaction_query = \"Manchester City playing style analysis tactical\"\n", + "interaction_results = storage.search(\n", + " query=interaction_query,\n", + " limit=3,\n", + " score_threshold=0.0\n", + ")\n", + "\n", + "print(\"\\nAgent Interaction Memory Results:\")\n", + "print(\"-\" * 80)\n", + "if interaction_results:\n", + " for result in interaction_results:\n", + " print(f\"Context: {result['context'][:200]}...\") # Limit output size\n", + " print(f\"Distance: {result['distance']} (lower = more similar)\")\n", + " print(\"-\" * 80)\n", + "else:\n", + " print(\"No interaction memories found. This is normal if agents haven't completed tasks yet.\")\n", + " print(\"-\" * 80)" + ] + }, + { + "cell_type": "markdown", + "id": "d23b2fbe", + "metadata": {}, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "d21915e5", + "metadata": {}, + "source": [ + "You've successfully implemented a custom memory backend for CrewAI agents using Couchbase GSI vector search!" + ] + } + ], + "metadata": { + "jupytext": { + "cell_metadata_filter": "-all", + "main_language": "python", + "notebook_metadata_filter": "-all" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/crewai-short-term-memory/gsi/frontmatter.md b/crewai-short-term-memory/query_based/frontmatter.md similarity index 100% rename from crewai-short-term-memory/gsi/frontmatter.md rename to crewai-short-term-memory/query_based/frontmatter.md diff --git a/crewai-short-term-memory/gsi/.env.sample b/crewai-short-term-memory/search_based/.env.sample similarity index 100% rename from crewai-short-term-memory/gsi/.env.sample rename to crewai-short-term-memory/search_based/.env.sample diff --git a/crewai-short-term-memory/search_based/CouchbaseStorage_Demo.ipynb b/crewai-short-term-memory/search_based/CouchbaseStorage_Demo.ipynb new file mode 100644 index 00000000..3aa71002 --- /dev/null +++ b/crewai-short-term-memory/search_based/CouchbaseStorage_Demo.ipynb @@ -0,0 +1,1212 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# CrewAI with Couchbase Short-Term Memory\n", + "\n", + "This notebook demonstrates how to implement a custom storage backend for CrewAI's memory system using Couchbase and vector search. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-crewai-short-term-memory-couchbase-with-hyperscale-or-composite-vector-index)\n", + "\n", + "Here's a breakdown of each section:\n", + "\n", + "How to run this tutorial\n", + "----------------------\n", + "This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run \n", + "interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/crewai-short-term-memory/fts/CouchbaseStorage_Demo.ipynb).\n", + "\n", + "You can either:\n", + "- Download the notebook file and run it on [Google Colab](https://colab.research.google.com)\n", + "- Run it on your system by setting up the Python environment\n", + "\n", + "Before you start\n", + "---------------\n", + "\n", + "1. Create and Deploy Your Free Tier Operational cluster on [Capella](https://cloud.couchbase.com/sign-up)\n", + " - To get started with [Couchbase Capella](https://cloud.couchbase.com), create an account and use it to deploy \n", + " a forever free tier operational cluster\n", + " - This account provides you with an environment where you can explore and learn \n", + " about Capella with no time constraint\n", + " - To learn more, please follow the [Getting Started Guide](https://docs.couchbase.com/cloud/get-started/create-account.html)\n", + "\n", + "2. Couchbase Capella Configuration\n", + " When running Couchbase using Capella, the following prerequisites need to be met:\n", + " - Create the database credentials to access the required bucket (Read and Write) \n", + " used in the application\n", + " - Allow access to the Cluster from the IP on which the application is running by following the [Network Security documentation](https://docs.couchbase.com/cloud/security/security.html#public-access)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Memory in AI Agents\n", + "\n", + "Memory in AI agents is a crucial capability that allows them to retain and utilize information across interactions, making them more effective and contextually aware. Without memory, agents would be limited to processing only the immediate input, lacking the ability to build upon past experiences or maintain continuity in conversations.\n", + "\n", + "> Note: This section on memory types and functionality is adapted from the CrewAI documentation.\n", + "\n", + "## Types of Memory in AI Agents\n", + "\n", + "### Short-term Memory\n", + "- Retains recent interactions and context\n", + "- Typically spans the current conversation or session \n", + "- Helps maintain coherence within a single interaction flow\n", + "- In CrewAI, this is what we're implementing with the Couchbase storage\n", + "\n", + "### Long-term Memory\n", + "- Stores persistent knowledge across multiple sessions\n", + "- Enables agents to recall past interactions even after long periods\n", + "- Helps build cumulative knowledge about users, preferences, and past decisions\n", + "- While this implementation is labeled as \"short-term memory\", the Couchbase storage backend can be effectively used for long-term memory as well, thanks to Couchbase's persistent storage capabilities and enterprise-grade durability features\n", + "\n", + "\n", + "\n", + "## How Memory Works in Agents\n", + "Memory in AI agents typically involves:\n", + "- Storage: Information is encoded and stored in a database (like Couchbase, ChromaDB, or other vector stores)\n", + "- Retrieval: Relevant memories are fetched based on semantic similarity to current context\n", + "- Integration: Retrieved memories are incorporated into the agent's reasoning process\n", + "\n", + "In the CrewAI example, the CouchbaseStorage class implements:\n", + "- save(): Stores new memories with metadata\n", + "- search(): Retrieves relevant memories based on semantic similarity\n", + "- reset(): Clears stored memories when needed\n", + "\n", + "## Benefits of Memory in AI Agents\n", + "- Contextual Understanding: Agents can refer to previous parts of a conversation\n", + "- Personalization: Remembering user preferences and past interactions\n", + "- Learning and Adaptation: Building knowledge over time to improve responses\n", + "- Task Continuity: Resuming complex tasks across multiple interactions\n", + "- Collaboration: In multi-agent systems like CrewAI, memory enables agents to build on each other's work\n", + "\n", + "## Memory in CrewAI Specifically\n", + "In CrewAI, memory serves several important functions:\n", + "- Agent Specialization: Each agent can maintain its own memory relevant to its expertise\n", + "- Knowledge Transfer: Agents can share insights through memory when collaborating on tasks\n", + "- Process Continuity: In sequential processes, later agents can access the work of earlier agents\n", + "- Contextual Awareness: Agents can reference previous findings when making decisions\n", + "\n", + "The vector-based approach (using embeddings) is particularly powerful because it allows for semantic search - finding memories that are conceptually related to the current context, not just exact keyword matches.\n", + "\n", + "By implementing custom storage like Couchbase, you gain additional benefits like persistence, scalability, and the ability to leverage enterprise-grade database features for your agent memory systems." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install Required Libraries\n", + "\n", + "This section installs the necessary Python packages:\n", + "- `crewai`: The main CrewAI framework\n", + "- `langchain-couchbase`: LangChain integration for Couchbase\n", + "- `langchain-openai`: LangChain integration for OpenAI\n", + "- `python-dotenv`: For loading environment variables" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet crewai==0.186.1 langchain-couchbase==0.4.0 langchain-openai==0.3.33 python-dotenv==1.1.1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Importing Necessary Libraries\n", + "\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Any, Dict, List, Optional\n", + "import os\n", + "import logging\n", + "from datetime import timedelta\n", + "from dotenv import load_dotenv\n", + "from crewai.memory.storage.rag_storage import RAGStorage\n", + "from crewai.memory.short_term.short_term_memory import ShortTermMemory\n", + "from crewai import Agent, Crew, Task, Process\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.options import ClusterOptions\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.diagnostics import PingState, ServiceType\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", + "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n", + "import time\n", + "import json\n", + "import uuid\n", + "\n", + "# Configure logging (disabled)\n", + "logging.basicConfig(level=logging.CRITICAL)\n", + "logger = logging.getLogger(__name__)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading Sensitive Information\n", + "\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script uses environment variables to store sensitive information, enhancing the overall security and maintainability of your code by avoiding hardcoded values.\n", + "\n", + "### Setting Up Environment Variables\n", + "\n", + "> **Note:** This implementation reads configuration parameters from environment variables. Before running this notebook, you need to set the following environment variables:\n", + ">\n", + "> - `OPENAI_API_KEY`: Your OpenAI API key for generating embeddings\n", + "> - `CB_HOST`: Couchbase cluster connection string (e.g., \"couchbases://cb.example.com\")\n", + "> - `CB_USERNAME`: Username for Couchbase authentication\n", + "> - `CB_PASSWORD`: Password for Couchbase authentication\n", + "> - `CB_BUCKET_NAME` (optional): Bucket name (defaults to \"vector-search-testing\")\n", + "> - `SCOPE_NAME` (optional): Scope name (defaults to \"shared\")\n", + "> - `COLLECTION_NAME` (optional): Collection name (defaults to \"crew\")\n", + "> - `INDEX_NAME` (optional): Vector search index name (defaults to \"vector_search_crew\")\n", + ">\n", + "> You can set these variables in a `.env` file in the same directory as this notebook, or set them directly in your environment." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv(\"./.env\")\n", + "\n", + "# Verify environment variables\n", + "required_vars = ['OPENAI_API_KEY', 'CB_HOST', 'CB_USERNAME', 'CB_PASSWORD']\n", + "for var in required_vars:\n", + " if not os.getenv(var):\n", + " raise ValueError(f\"{var} environment variable is required\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Implement CouchbaseStorage\n", + "\n", + "This section demonstrates the implementation of a custom vector storage solution using Couchbase:\n", + "\n", + "> **Note on Implementation:** This example uses the LangChain Couchbase integration (`langchain_couchbase`) for simplicity and to demonstrate integration with the broader LangChain ecosystem. In production environments, you may want to use the Couchbase SDK directly for better performance and more control.\n", + "\n", + "> For more information on using the Couchbase SDK directly, refer to:\n", + "> - [Couchbase Python SDK Documentation](https://docs.couchbase.com/python-sdk/current/howtos/full-text-searching-with-sdk.html#single-vector-query)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "class CouchbaseStorage(RAGStorage):\n", + " \"\"\"\n", + " Extends RAGStorage to handle embeddings for memory entries using Couchbase.\n", + " \"\"\"\n", + "\n", + " def __init__(self, type: str, allow_reset: bool = True, embedder_config: Optional[Dict[str, Any]] = None, crew: Optional[Any] = None):\n", + " \"\"\"Initialize CouchbaseStorage with configuration.\"\"\"\n", + " super().__init__(type, allow_reset, embedder_config, crew)\n", + " self._initialize_app()\n", + "\n", + " def search(\n", + " self,\n", + " query: str,\n", + " limit: int = 3,\n", + " filter: Optional[dict] = None,\n", + " score_threshold: float = 0,\n", + " ) -> List[Dict[str, Any]]:\n", + " \"\"\"\n", + " Search memory entries using vector similarity.\n", + " \"\"\"\n", + " try:\n", + " # Add type filter\n", + " search_filter = {\"memory_type\": self.type}\n", + " if filter:\n", + " search_filter.update(filter)\n", + "\n", + " # Execute search\n", + " results = self.vector_store.similarity_search_with_score(\n", + " query,\n", + " k=limit,\n", + " filter=search_filter\n", + " )\n", + " \n", + " # Format results and deduplicate by content\n", + " seen_contents = set()\n", + " formatted_results = []\n", + " \n", + " for i, (doc, score) in enumerate(results):\n", + " if score >= score_threshold:\n", + " content = doc.page_content\n", + " if content not in seen_contents:\n", + " seen_contents.add(content)\n", + " formatted_results.append({\n", + " \"id\": doc.metadata.get(\"memory_id\", str(i)),\n", + " \"metadata\": doc.metadata,\n", + " \"context\": content,\n", + " \"score\": float(score)\n", + " })\n", + " \n", + " logger.info(f\"Found {len(formatted_results)} unique results for query: {query}\")\n", + " return formatted_results\n", + "\n", + " except Exception as e:\n", + " logger.error(f\"Search failed: {str(e)}\")\n", + " return []\n", + "\n", + " def save(self, value: Any, metadata: Dict[str, Any]) -> None:\n", + " \"\"\"\n", + " Save a memory entry with metadata.\n", + " \"\"\"\n", + " try:\n", + " # Generate unique ID\n", + " memory_id = str(uuid.uuid4())\n", + " timestamp = int(time.time() * 1000)\n", + " \n", + " # Prepare metadata (create a copy to avoid modifying references)\n", + " if not metadata:\n", + " metadata = {}\n", + " else:\n", + " metadata = metadata.copy() # Create a copy to avoid modifying references\n", + " \n", + " # Process agent-specific information if present\n", + " agent_name = metadata.get('agent', 'unknown')\n", + " \n", + " # Clean up value if it has the typical LLM response format\n", + " value_str = str(value)\n", + " if \"Final Answer:\" in value_str:\n", + " # Extract just the actual content - everything after \"Final Answer:\"\n", + " parts = value_str.split(\"Final Answer:\", 1)\n", + " if len(parts) > 1:\n", + " value = parts[1].strip()\n", + " logger.info(f\"Cleaned up response format for agent: {agent_name}\")\n", + " elif value_str.startswith(\"Thought:\"):\n", + " # Handle thought/final answer format\n", + " if \"Final Answer:\" in value_str:\n", + " parts = value_str.split(\"Final Answer:\", 1)\n", + " if len(parts) > 1:\n", + " value = parts[1].strip()\n", + " logger.info(f\"Cleaned up thought process format for agent: {agent_name}\")\n", + " \n", + " # Update metadata\n", + " metadata.update({\n", + " \"memory_id\": memory_id,\n", + " \"memory_type\": self.type,\n", + " \"timestamp\": timestamp,\n", + " \"source\": \"crewai\"\n", + " })\n", + "\n", + " # Log memory information for debugging\n", + " value_preview = str(value)[:100] + \"...\" if len(str(value)) > 100 else str(value)\n", + " metadata_preview = {k: v for k, v in metadata.items() if k != \"embedding\"}\n", + " logger.info(f\"Saving memory for Agent: {agent_name}\")\n", + " logger.info(f\"Memory value preview: {value_preview}\")\n", + " logger.info(f\"Memory metadata: {metadata_preview}\")\n", + " \n", + " # Convert value to string if needed\n", + " if isinstance(value, (dict, list)):\n", + " value = json.dumps(value)\n", + " elif not isinstance(value, str):\n", + " value = str(value)\n", + "\n", + " # Save to vector store\n", + " self.vector_store.add_texts(\n", + " texts=[value],\n", + " metadatas=[metadata],\n", + " ids=[memory_id]\n", + " )\n", + " logger.info(f\"Saved memory {memory_id}: {value[:100]}...\")\n", + "\n", + " except Exception as e:\n", + " logger.error(f\"Save failed: {str(e)}\")\n", + " raise\n", + "\n", + " def reset(self) -> None:\n", + " \"\"\"Reset the memory storage if allowed.\"\"\"\n", + " if not self.allow_reset:\n", + " return\n", + "\n", + " try:\n", + " # Delete documents of this memory type\n", + " self.cluster.query(\n", + " f\"DELETE FROM `{self.bucket_name}`.`{self.scope_name}`.`{self.collection_name}` WHERE memory_type = $type\",\n", + " type=self.type\n", + " ).execute()\n", + " logger.info(f\"Reset memory type: {self.type}\")\n", + " except Exception as e:\n", + " logger.error(f\"Reset failed: {str(e)}\")\n", + " raise\n", + "\n", + " def _initialize_app(self):\n", + " \"\"\"Initialize Couchbase connection and vector store.\"\"\"\n", + " try:\n", + " # Initialize embeddings\n", + " if self.embedder_config and self.embedder_config.get(\"provider\") == \"openai\":\n", + " self.embeddings = OpenAIEmbeddings(\n", + " openai_api_key=os.getenv('OPENAI_API_KEY'),\n", + " model=self.embedder_config.get(\"config\", {}).get(\"model\", \"text-embedding-3-small\")\n", + " )\n", + " else:\n", + " self.embeddings = OpenAIEmbeddings(\n", + " openai_api_key=os.getenv('OPENAI_API_KEY'),\n", + " model=\"text-embedding-3-small\"\n", + " )\n", + "\n", + " # Connect to Couchbase\n", + " auth = PasswordAuthenticator(\n", + " os.getenv('CB_USERNAME', ''),\n", + " os.getenv('CB_PASSWORD', '')\n", + " )\n", + " options = ClusterOptions(auth)\n", + " \n", + " # Initialize cluster connection\n", + " self.cluster = Cluster(os.getenv('CB_HOST', ''), options)\n", + " self.cluster.wait_until_ready(timedelta(seconds=5))\n", + "\n", + " # Check search service\n", + " ping_result = self.cluster.ping()\n", + " search_available = False\n", + " for service_type, endpoints in ping_result.endpoints.items():\n", + " if service_type == ServiceType.Search:\n", + " for endpoint in endpoints:\n", + " if endpoint.state == PingState.OK:\n", + " search_available = True\n", + " logger.info(f\"Search service is responding at: {endpoint.remote}\")\n", + " break\n", + " break\n", + " if not search_available:\n", + " raise RuntimeError(\"Search/FTS service not found or not responding\")\n", + " \n", + " # Set up storage configuration\n", + " self.bucket_name = os.getenv('CB_BUCKET_NAME', 'vector-search-testing')\n", + " self.scope_name = os.getenv('SCOPE_NAME', 'shared')\n", + " self.collection_name = os.getenv('COLLECTION_NAME', 'crew')\n", + " self.index_name = os.getenv('INDEX_NAME', 'vector_search_crew')\n", + "\n", + " # Initialize vector store\n", + " self.vector_store = CouchbaseSearchVectorStore(\n", + " cluster=self.cluster,\n", + " bucket_name=self.bucket_name,\n", + " scope_name=self.scope_name,\n", + " collection_name=self.collection_name,\n", + " embedding=self.embeddings,\n", + " index_name=self.index_name\n", + " )\n", + " logger.info(f\"Initialized CouchbaseStorage for type: {self.type}\")\n", + "\n", + " except Exception as e:\n", + " logger.error(f\"Initialization failed: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test Basic Storage\n", + "\n", + "Test storing and retrieving a simple memory:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize storage\n", + "storage = CouchbaseStorage(\n", + " type=\"short_term\",\n", + " embedder_config={\n", + " \"provider\": \"openai\",\n", + " \"config\": {\"model\": \"text-embedding-3-small\"}\n", + " }\n", + ")\n", + "\n", + "# Reset storage\n", + "storage.reset()\n", + "\n", + "# Test storage\n", + "test_memory = \"Pep Guardiola praised Manchester City's current form, saying 'The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.'\"\n", + "test_metadata = {\"category\": \"sports\", \"test\": \"initial_memory\"}\n", + "storage.save(test_memory, test_metadata)\n", + "\n", + "# Test search\n", + "results = storage.search(\"What did Guardiola say about Manchester City?\", limit=1)\n", + "for result in results:\n", + " print(f\"Found: {result['context']}\\nScore: {result['score']}\\nMetadata: {result['metadata']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test CrewAI Integration\n", + "\n", + "Create agents and tasks to test memory retention:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Crew Execution Started \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Crew Execution Started                                                                                         \u2502\n",
+       "\u2502  Name: crew                                                                                                     \u2502\n",
+       "\u2502  ID: 7ac56ae1-b62f-4b07-952c-104a7243edb0                                                                       \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[36m\u256d\u2500\u001b[0m\u001b[36m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[36m Crew Execution Started \u001b[0m\u001b[36m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[36m\u2500\u256e\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[1;36mCrew Execution Started\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[36mcrew\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mID: \u001b[0m\u001b[36m7ac56ae1-b62f-4b07-952c-104a7243edb0\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
+       "\"ipywidgets\" for Jupyter support\n",
+       "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
+       "
\n" + ], + "text/plain": [ + "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", + "\"ipywidgets\" for Jupyter support\n", + " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83e\udde0 Retrieved Memory \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Historical Data:                                                                                               \u2502\n",
+       "\u2502  - Ensure that the analysis contains specific examples or statistics to support the claims made about team      \u2502\n",
+       "\u2502  performance.                                                                                                   \u2502\n",
+       "\u2502  - Include insights from other sources or viewpoints to provide a well-rounded analysis.                        \u2502\n",
+       "\u2502  - Provide a comparison with past performance to highlight improvements or consistencies.                       \u2502\n",
+       "\u2502  - Include player-specific analysis if individual performance is hinted at in the comments.                     \u2502\n",
+       "\u2502  Entities:                                                                                                      \u2502\n",
+       "\u2502  - Pep Guardiola(Football Manager): The current manager of Manchester City, known fo...                         \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Retrieval Time: 1384.18ms \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m \ud83e\udde0 Retrieved Memory \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mHistorical Data:\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Ensure that the analysis contains specific examples or statistics to support the claims made about team \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mperformance.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Include insights from other sources or viewpoints to provide a well-rounded analysis.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Provide a comparison with past performance to highlight improvements or consistencies.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Include player-specific analysis if individual performance is hinted at in the comments.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mEntities:\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Pep Guardiola(Football Manager): The current manager of Manchester City, known fo...\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Retrieval Time: 1384.18ms \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83e\udd16 Agent Started \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Agent: Sports Analyst                                                                                          \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task: Analyze Manchester City's recent performance based on Pep Guardiola's comments: \"The team is playing     \u2502\n",
+       "\u2502  well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"         \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[35m\u256d\u2500\u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m \ud83e\udd16 Agent Started \u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m\u2500\u256e\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mSports Analyst\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mTask: \u001b[0m\u001b[92mAnalyze Manchester City's recent performance based on Pep Guardiola's comments: \"The team is playing \u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[92mwell, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
+       "\"ipywidgets\" for Jupyter support\n",
+       "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
+       "
\n" + ], + "text/plain": [ + "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", + "\"ipywidgets\" for Jupyter support\n", + " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
+       "\"ipywidgets\" for Jupyter support\n",
+       "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
+       "
\n" + ], + "text/plain": [ + "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", + "\"ipywidgets\" for Jupyter support\n", + " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
+       "\"ipywidgets\" for Jupyter support\n",
+       "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
+       "
\n" + ], + "text/plain": [ + "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", + "\"ipywidgets\" for Jupyter support\n", + " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Task Completion \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task Completed                                                                                                 \u2502\n",
+       "\u2502  Name: 721d99b2-ac47-4976-8862-364bb668075e                                                                     \u2502\n",
+       "\u2502  Agent: Sports Analyst                                                                                          \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m721d99b2-ac47-4976-8862-364bb668075e\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mSports Analyst\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
+       "\"ipywidgets\" for Jupyter support\n",
+       "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
+       "
\n" + ], + "text/plain": [ + "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", + "\"ipywidgets\" for Jupyter support\n", + " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83e\udde0 Retrieved Memory \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Historical Data:                                                                                               \u2502\n",
+       "\u2502  - Include specific quotes from Guardiola to enhance credibility.                                               \u2502\n",
+       "\u2502  - Incorporate statistical data or match results to provide more depth.                                         \u2502\n",
+       "\u2502  - Discuss recent matches or events in more detail.                                                             \u2502\n",
+       "\u2502  - Add perspectives from players or other analysts for a more rounded view.                                     \u2502\n",
+       "\u2502  - Include potential future challenges for Manchester City.                                                     \u2502\n",
+       "\u2502  Entities:                                                                                                      \u2502\n",
+       "\u2502  - Pep Guardiola(Individual): The manager of Manchester City, known for his tactical acumen and positive        \u2502\n",
+       "\u2502  remarks about the team's performance.                                                                          \u2502\n",
+       "\u2502  - Manch...                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Retrieval Time: 991.13ms \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m \ud83e\udde0 Retrieved Memory \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mHistorical Data:\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Include specific quotes from Guardiola to enhance credibility.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Incorporate statistical data or match results to provide more depth.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Discuss recent matches or events in more detail.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Add perspectives from players or other analysts for a more rounded view.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Include potential future challenges for Manchester City.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mEntities:\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Pep Guardiola(Individual): The manager of Manchester City, known for his tactical acumen and positive \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mremarks about the team's performance.\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37m- Manch...\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Retrieval Time: 991.13ms \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83e\udd16 Agent Started \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Agent: Sports Journalist                                                                                       \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task: Write a sports article about Manchester City's form using the analysis and Guardiola's comments.         \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[35m\u256d\u2500\u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m \ud83e\udd16 Agent Started \u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m\u2500\u256e\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mSports Journalist\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mTask: \u001b[0m\u001b[92mWrite a sports article about Manchester City's form using the analysis and Guardiola's comments.\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
+       "\"ipywidgets\" for Jupyter support\n",
+       "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
+       "
\n" + ], + "text/plain": [ + "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", + "\"ipywidgets\" for Jupyter support\n", + " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
+       "\"ipywidgets\" for Jupyter support\n",
+       "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
+       "
\n" + ], + "text/plain": [ + "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", + "\"ipywidgets\" for Jupyter support\n", + " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n",
+       "\"ipywidgets\" for Jupyter support\n",
+       "  warnings.warn('install \"ipywidgets\" for Jupyter support')\n",
+       "
\n" + ], + "text/plain": [ + "/Users/viraj.agarwal/Tasks/Task10/.venv/lib/python3.13/site-packages/rich/live.py:256: UserWarning: install \n", + "\"ipywidgets\" for Jupyter support\n", + " warnings.warn('install \"ipywidgets\" for Jupyter support')\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Task Completion \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task Completed                                                                                                 \u2502\n",
+       "\u2502  Name: 4fac1a2b-0fd1-484e-afe6-a4d4af236bd4                                                                     \u2502\n",
+       "\u2502  Agent: Sports Journalist                                                                                       \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m4fac1a2b-0fd1-484e-afe6-a4d4af236bd4\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mSports Journalist\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Crew Result:\n", + "--------------------------------------------------------------------------------\n", + "**Manchester City's Resilient Form Under Guardiola: A Symphony of Strategy and Skill**\n", + "\n", + "In the ever-competitive landscape of the Premier League, Manchester City continues to set the benchmark for excellence, guided by the strategic genius of Pep Guardiola. Reflecting on their current form, Guardiola's satisfaction is palpable: \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\" These words not only highlight the team's current high morale but also underline the effectiveness of their training routines and the cohesive unit that Guardiola has meticulously crafted.\n", + "\n", + "Historically, Manchester City has been a juggernaut in English football, and their recent performances are a testament to their sustained dominance. Their consistency in maintaining high possession rates and crafting scoring opportunities is unparalleled. Statistically, City often leads in metrics such as ball possession and pass accuracy, with figures regularly surpassing 60% possession in matches, illustrating their control and domination on the pitch.\n", + "\n", + "Key to their success has been the stellar performances of individual players. Kevin De Bruyne's vision and precise passing have been instrumental in creating goal-scoring chances, while Erling Haaland's formidable goal-scoring abilities add a lethal edge to City's attack. Phil Foden's adaptability and technical prowess offer Guardiola the flexibility to shuffle tactics seamlessly. This trident of talent epitomizes the blend of skill and strategy that City embodies.\n", + "\n", + "Defensively, Manchester City has shown marked improvement, a testament to Guardiola's focus on fortifying the backline. Their defensive solidity, coupled with an attacking flair, makes them a daunting adversary for any team. Guardiola's ability to adapt tactics to counter various styles of play is a hallmark of his tenure, ensuring City remains at the pinnacle of competition both domestically and on the European stage.\n", + "\n", + "Analysts and pundits echo Guardiola's sentiments, praising Manchester City's ability to maintain elite standards and adapt to challenges with finesse. This holistic approach\u2014encompassing rigorous training, strategic gameplay, and individual brilliance\u2014cements Manchester City's status as leaders in football excellence.\n", + "\n", + "However, the journey is far from over. As they navigate the rigors of the Premier League and European competitions, potential challenges loom. Sustaining fitness levels, managing squad rotations, and countering tactical innovations from rivals will be pivotal. Yet, with Guardiola at the helm, Manchester City is well-equipped to tackle these challenges head-on.\n", + "\n", + "In conclusion, Manchester City's current form is a shining example of Guardiola's managerial prowess and the team's harmonious performance. Their continued success is a blend of strategic training, tactical adaptability, and outstanding individual contributions, positioning them as formidable contenders in any arena. As the season unfolds, fans and analysts alike will watch with bated breath to see how this footballing symphony continues to play out.\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# Initialize ShortTermMemory with our storage\n", + "memory = ShortTermMemory(storage=storage)\n", + "\n", + "# Initialize language model\n", + "llm = ChatOpenAI(\n", + " model=\"gpt-4o\",\n", + " temperature=0.7\n", + ")\n", + "\n", + "# Create agents with memory\n", + "sports_analyst = Agent(\n", + " role='Sports Analyst',\n", + " goal='Analyze Manchester City performance',\n", + " backstory='Expert at analyzing football teams and providing insights on their performance',\n", + " llm=llm,\n", + " memory=True,\n", + " memory_storage=memory\n", + ")\n", + "\n", + "journalist = Agent(\n", + " role='Sports Journalist',\n", + " goal='Create engaging football articles',\n", + " backstory='Experienced sports journalist who specializes in Premier League coverage',\n", + " llm=llm,\n", + " memory=True,\n", + " memory_storage=memory\n", + ")\n", + "\n", + "# Create tasks\n", + "analysis_task = Task(\n", + " description='Analyze Manchester City\\'s recent performance based on Pep Guardiola\\'s comments: \"The team is playing well, we are in a good moment. The way we are training, the way we are playing - I am really pleased.\"',\n", + " agent=sports_analyst,\n", + " expected_output=\"A comprehensive analysis of Manchester City's current form based on Guardiola's comments.\"\n", + ")\n", + "\n", + "writing_task = Task(\n", + " description='Write a sports article about Manchester City\\'s form using the analysis and Guardiola\\'s comments.',\n", + " agent=journalist,\n", + " context=[analysis_task],\n", + " expected_output=\"An engaging sports article about Manchester City's current form and Guardiola's perspective.\"\n", + ")\n", + "\n", + "# Create crew with memory\n", + "crew = Crew(\n", + " agents=[sports_analyst, journalist],\n", + " tasks=[analysis_task, writing_task],\n", + " process=Process.sequential,\n", + " memory=True,\n", + " short_term_memory=memory, # Explicitly pass our memory implementation\n", + " verbose=True\n", + ")\n", + "\n", + "# Run the crew\n", + "result = crew.kickoff()\n", + "\n", + "print(\"\\nCrew Result:\")\n", + "print(\"-\" * 80)\n", + "print(result)\n", + "print(\"-\" * 80)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test Memory Retention\n", + "\n", + "Query the stored memories to verify retention:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "All memory entries in Couchbase:\n", + "--------------------------------------------------------------------------------\n", + "\n", + "Memory Search Results:\n", + "--------------------------------------------------------------------------------\n", + "\n", + "Agent Interaction Memory Results:\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# Wait for memories to be stored\n", + "time.sleep(2)\n", + "\n", + "# List all documents in the collection\n", + "try:\n", + " # Query to fetch all documents of this memory type\n", + " query_str = f\"SELECT META().id, * FROM `{storage.bucket_name}`.`{storage.scope_name}`.`{storage.collection_name}` WHERE memory_type = $type\"\n", + " query_result = storage.cluster.query(query_str, type=storage.type)\n", + " \n", + " print(f\"\\nAll memory entries in Couchbase:\")\n", + " print(\"-\" * 80)\n", + " for i, row in enumerate(query_result, 1):\n", + " doc_id = row.get('id')\n", + " memory_id = row.get(storage.collection_name, {}).get('memory_id', 'unknown')\n", + " content = row.get(storage.collection_name, {}).get('text', '')[:100] + \"...\" # Truncate for readability\n", + " \n", + " print(f\"Entry {i}:\")\n", + " print(f\"ID: {doc_id}\")\n", + " print(f\"Memory ID: {memory_id}\")\n", + " print(f\"Content: {content}\")\n", + " print(\"-\" * 80)\n", + "except Exception as e:\n", + " print(f\"Failed to list memory entries: {str(e)}\")\n", + "\n", + "# Test memory retention\n", + "memory_query = \"What is Manchester City's current form according to Guardiola?\"\n", + "memory_results = storage.search(\n", + " query=memory_query,\n", + " limit=5, # Increased to see more results\n", + " score_threshold=0.0 # Lower threshold to see all results\n", + ")\n", + "\n", + "print(\"\\nMemory Search Results:\")\n", + "print(\"-\" * 80)\n", + "for result in memory_results:\n", + " print(f\"Context: {result['context']}\")\n", + " print(f\"Score: {result['score']}\")\n", + " print(\"-\" * 80)\n", + "\n", + "# Try a more specific query to find agent interactions\n", + "interaction_query = \"Manchester City playing style analysis tactical\"\n", + "interaction_results = storage.search(\n", + " query=interaction_query,\n", + " limit=5,\n", + " score_threshold=0.0\n", + ")\n", + "\n", + "print(\"\\nAgent Interaction Memory Results:\")\n", + "print(\"-\" * 80)\n", + "for result in interaction_results:\n", + " print(f\"Context: {result['context'][:200]}...\") # Limit output size\n", + " print(f\"Score: {result['score']}\")\n", + " print(\"-\" * 80)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/crewai-short-term-memory/fts/crew_index.json b/crewai-short-term-memory/search_based/crew_index.json similarity index 100% rename from crewai-short-term-memory/fts/crew_index.json rename to crewai-short-term-memory/search_based/crew_index.json diff --git a/crewai-short-term-memory/fts/frontmatter.md b/crewai-short-term-memory/search_based/frontmatter.md similarity index 100% rename from crewai-short-term-memory/fts/frontmatter.md rename to crewai-short-term-memory/search_based/frontmatter.md diff --git a/crewai/fts/RAG_with_Couchbase_and_CrewAI.ipynb b/crewai/fts/RAG_with_Couchbase_and_CrewAI.ipynb deleted file mode 100644 index ad9c4e61..00000000 --- a/crewai/fts/RAG_with_Couchbase_and_CrewAI.ipynb +++ /dev/null @@ -1,1464 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction\n", - "\n", - "In this guide, we will walk you through building a powerful semantic search engine using [Couchbase](https://www.couchbase.com) as the backend database and [CrewAI](https://github.com/crewAIInc/crewAI) for agent-based RAG operations. CrewAI allows us to create specialized agents that can work together to handle different aspects of the RAG workflow, from document retrieval to response generation. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-crewai-couchbase-rag-with-global-secondary-index)\n", - "\n", - "How to run this tutorial\n", - "----------------------\n", - "This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run \n", - "interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/crewai/fts/RAG_with_Couchbase_and_CrewAI.ipynb).\n", - "\n", - "You can either:\n", - "- Download the notebook file and run it on [Google Colab](https://colab.research.google.com)\n", - "- Run it on your system by setting up the Python environment\n", - "\n", - "Before you start\n", - "---------------\n", - "\n", - "1. Create and Deploy Your Free Tier Operational cluster on [Capella](https://cloud.couchbase.com/sign-up)\n", - " - To get started with [Couchbase Capella](https://cloud.couchbase.com), create an account and use it to deploy \n", - " a forever free tier operational cluster\n", - " - This account provides you with an environment where you can explore and learn \n", - " about Capella with no time constraint\n", - " - To learn more, please follow the [Getting Started Guide](https://docs.couchbase.com/cloud/get-started/create-account.html)\n", - "\n", - "2. Couchbase Capella Configuration\n", - " When running Couchbase using Capella, the following prerequisites need to be met:\n", - " - Create the database credentials to access the required bucket (Read and Write) used in the application\n", - " - Allow access to the Cluster from the IP on which the application is running by following the [Network Security documentation](https://docs.couchbase.com/cloud/security/security.html#public-access)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "\n", - "We'll install the following key libraries:\n", - "- `datasets`: For loading and managing our training data\n", - "- `langchain-couchbase`: To integrate Couchbase with LangChain for vector storage and caching\n", - "- `langchain-openai`: For accessing OpenAI's embedding and chat models\n", - "- `crewai`: To create and orchestrate our AI agents for RAG operations\n", - "- `python-dotenv`: For securely managing environment variables and API keys\n", - "\n", - "These libraries provide the foundation for building a semantic search engine with vector embeddings, \n", - "database integration, and agent-based RAG capabilities." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet datasets==4.1.0 langchain-couchbase==0.4.0 langchain-openai==0.3.33 crewai==0.186.1 python-dotenv==1.1.1 ipywidgets" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.diagnostics import PingState, ServiceType\n", - "from couchbase.exceptions import (InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - " ServiceUnavailableException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from crewai.tools import tool\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", - "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n", - "\n", - "from crewai import Agent, Crew, Process, Task" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "logging.basicConfig(\n", - " level=logging.INFO,\n", - " format='%(asctime)s [%(levelname)s] %(message)s',\n", - " datefmt='%Y-%m-%d %H:%M:%S'\n", - ")\n", - "\n", - "# Suppress httpx logging\n", - "logging.getLogger('httpx').setLevel(logging.CRITICAL)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Loading Sensitive Information\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script uses environment variables to store sensitive information, enhancing the overall security and maintainability of your code by avoiding hardcoded values." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Configuration loaded successfully\n" - ] - } - ], - "source": [ - "# Load environment variables\n", - "load_dotenv(\"./.env\")\n", - "\n", - "# Configuration\n", - "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or input(\"Enter your OpenAI API key: \")\n", - "if not OPENAI_API_KEY:\n", - " raise ValueError(\"OPENAI_API_KEY is not set\")\n", - "\n", - "CB_HOST = os.getenv('CB_HOST') or input(\"Enter Couchbase host (default: couchbase://localhost): \") or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input(\"Enter Couchbase username (default: Administrator): \") or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass(\"Enter Couchbase password (default: password): \") or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input(\"Enter bucket name (default: vector-search-testing): \") or 'vector-search-testing'\n", - "INDEX_NAME = os.getenv('INDEX_NAME') or input(\"Enter index name (default: vector_search_crew): \") or 'vector_search_crew'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input(\"Enter scope name (default: shared): \") or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input(\"Enter collection name (default: crew): \") or 'crew'\n", - "\n", - "print(\"Configuration loaded successfully\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "# Connect to Couchbase\n", - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " print(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " print(f\"Failed to connect to Couchbase: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Verifying Search Service Availability\n", - " In this section, we verify that the Couchbase Search (FTS) service is available and responding correctly. This is a crucial check because our vector search functionality depends on it. If any issues are detected with the Search service, the function will raise an exception, allowing us to catch and handle problems early before attempting vector operations.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Search service is responding at: 18.117.138.157:18094\n", - "Search service check passed successfully\n" - ] - } - ], - "source": [ - "def check_search_service(cluster):\n", - " \"\"\"Verify search service availability using ping\"\"\"\n", - " try:\n", - " # Get ping result\n", - " ping_result = cluster.ping()\n", - " search_available = False\n", - " \n", - " # Check if search service is responding\n", - " for service_type, endpoints in ping_result.endpoints.items():\n", - " if service_type == ServiceType.Search:\n", - " for endpoint in endpoints:\n", - " if endpoint.state == PingState.OK:\n", - " search_available = True\n", - " print(f\"Search service is responding at: {endpoint.remote}\")\n", - " break\n", - " break\n", - "\n", - " if not search_available:\n", - " raise RuntimeError(\"Search/FTS service not found or not responding\")\n", - " \n", - " print(\"Search service check passed successfully\")\n", - " except Exception as e:\n", - " print(f\"Health check failed: {str(e)}\")\n", - " raise\n", - "try:\n", - " check_search_service(cluster)\n", - "except Exception as e:\n", - " print(f\"Failed to check search service: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: If you are using Capella, create a bucket manually called vector-search-testing(or any name you prefer) with the same properties.\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Creates primary index on collection for query performance\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 14:34:30 [INFO] Bucket 'vector-search-testing' exists.\n", - "2025-09-17 14:34:32 [INFO] Scope 'shared' does not exist. Creating it...\n", - "2025-09-17 14:34:33 [INFO] Scope 'shared' created successfully.\n", - "2025-09-17 14:34:34 [INFO] Collection 'crew' does not exist. Creating it...\n", - "2025-09-17 14:34:36 [INFO] Collection 'crew' created successfully.\n", - "2025-09-17 14:34:41 [INFO] Primary index present or created successfully.\n", - "2025-09-17 14:34:43 [INFO] All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Ensure primary index exists\n", - " try:\n", - " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", - " logging.info(\"Primary index present or created successfully.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error creating primary index: {str(e)}\")\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Configuring and Initializing Couchbase Vector Search Index for Semantic Document Retrieval\n", - "\n", - "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase Vector Search Index comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", - "\n", - "This CrewAI vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `crew`. The configuration is set up for vectors with exactly `1536 dimensions`, using `dot product` similarity and optimized for `recall`. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", - "\n", - "For more information on creating a vector search index, please follow the instructions at [Couchbase Vector Search Documentation](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "# Load index definition\n", - "try:\n", - " with open('crew_index.json', 'r') as file:\n", - " index_definition = json.load(file)\n", - "except FileNotFoundError as e:\n", - " print(f\"Error: crew_index.json file not found: {str(e)}\")\n", - " raise\n", - "except json.JSONDecodeError as e:\n", - " print(f\"Error: Invalid JSON in crew_index.json: {str(e)}\")\n", - " raise\n", - "except Exception as e:\n", - " print(f\"Error loading index definition: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Creating or Updating Search Indexes\n", - "\n", - "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 14:34:47 [INFO] Creating new index 'vector_search_crew'...\n", - "2025-09-17 14:34:48 [INFO] Index 'vector_search_crew' successfully created/updated.\n" - ] - } - ], - "source": [ - "try:\n", - " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", - "\n", - " # Check if index already exists\n", - " existing_indexes = scope_index_manager.get_all_indexes()\n", - " index_name = index_definition[\"name\"]\n", - "\n", - " if index_name in [index.name for index in existing_indexes]:\n", - " logging.info(f\"Index '{index_name}' found\")\n", - " else:\n", - " logging.info(f\"Creating new index '{index_name}'...\")\n", - "\n", - " # Create SearchIndex object from JSON definition\n", - " search_index = SearchIndex.from_json(index_definition)\n", - "\n", - " # Upsert the index (create if not exists, update if exists)\n", - " scope_index_manager.upsert_index(search_index)\n", - " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", - "\n", - "except QueryIndexAlreadyExistsException:\n", - " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", - "except ServiceUnavailableException:\n", - " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", - "except InternalServerFailureException as e:\n", - " logging.error(f\"Internal server error: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up OpenAI Components\n", - "\n", - "This section initializes two key OpenAI components needed for our RAG system:\n", - "\n", - "1. OpenAI Embeddings:\n", - " - Uses the 'text-embedding-3-small' model\n", - " - Converts text into high-dimensional vector representations (embeddings)\n", - " - These embeddings enable semantic search by capturing the meaning of text\n", - " - Required for vector similarity search in Couchbase\n", - "\n", - "2. ChatOpenAI Language Model:\n", - " - Uses the 'gpt-4o' model\n", - " - Temperature set to 0.2 for balanced creativity and focus\n", - " - Serves as the cognitive engine for CrewAI agents\n", - " - Powers agent reasoning, decision-making, and task execution\n", - " - Enables agents to:\n", - " - Process and understand retrieved context from vector search\n", - " - Generate thoughtful responses based on that context\n", - " - Follow instructions defined in agent roles and goals\n", - " - Collaborate with other agents in the crew\n", - " - The relatively low temperature (0.2) ensures agents produce reliable,\n", - " consistent outputs while maintaining some creative problem-solving ability\n", - "\n", - "Both components require a valid OpenAI API key (OPENAI_API_KEY) for authentication.\n", - "In the CrewAI framework, the LLM acts as the \"brain\" for each agent, allowing them\n", - "to interpret tasks, retrieve relevant information via the RAG system, and generate\n", - "appropriate outputs based on their specialized roles and expertise." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "OpenAI components initialized\n" - ] - } - ], - "source": [ - "# Initialize OpenAI components\n", - "embeddings = OpenAIEmbeddings(\n", - " openai_api_key=OPENAI_API_KEY,\n", - " model=\"text-embedding-3-small\"\n", - ")\n", - "\n", - "llm = ChatOpenAI(\n", - " openai_api_key=OPENAI_API_KEY,\n", - " model=\"gpt-4o\",\n", - " temperature=0.2\n", - ")\n", - "\n", - "print(\"OpenAI components initialized\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setting Up the Couchbase Vector Store\n", - "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Vector store initialized\n" - ] - } - ], - "source": [ - "# Setup vector store\n", - "vector_store = CouchbaseSearchVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " index_name=INDEX_NAME,\n", - ")\n", - "print(\"Vector store initialized\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 14:35:10 [INFO] Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Error Handling: If an error occurs, only the current batch is affected\n", - "3. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "4. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 50 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 14:36:58 [INFO] Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 50\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Creating a Vector Search Tool\n", - "After loading our data into the vector store, we need to create a tool that can efficiently search through these vector embeddings. This involves two key components:\n", - "\n", - "### Vector Retriever\n", - "The vector retriever is configured to perform similarity searches. This creates a retriever that performs semantic similarity searches against our vector database. The similarity search finds documents whose vector embeddings are closest to the query's embedding in the vector space.\n", - "\n", - "### Search Tool\n", - "The search tool wraps the retriever in a user-friendly interface that:\n", - "- Takes a query string as input\n", - "- Passes the query to the retriever to find relevant documents\n", - "- Formats the results with clear document separation using document numbers and dividers\n", - "- Returns the formatted results as a single string with each document clearly delineated\n", - "\n", - "The tool is designed to integrate seamlessly with our AI agents, providing them with reliable access to our knowledge base through vector similarity search. The lambda function in the tool handles both direct string queries and structured query objects, ensuring flexibility in how the tool can be invoked.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "# Create vector retriever\n", - "retriever = vector_store.as_retriever(\n", - " search_type=\"similarity\",\n", - ")\n", - "\n", - "# Define the search tool using the @tool decorator\n", - "@tool(\"vector_search\")\n", - "def search_tool(query: str) -> str:\n", - " \"\"\"Search for relevant documents using vector similarity.\n", - " Input should be a simple text query string.\n", - " Returns a list of relevant document contents.\n", - " Use this tool to find detailed information about topics.\"\"\"\n", - " # Handle potential non-string query input if needed (similar to original lambda)\n", - " # CrewAI usually passes the string directly based on task description\n", - " # but checking doesn't hurt, though the Agent logic might handle this.\n", - " # query_str = query if isinstance(query, str) else str(query.get('query', '')) # Simplified for now\n", - "\n", - " # Invoke the retriever\n", - " docs = retriever.invoke(query)\n", - "\n", - " # Format the results\n", - " formatted_docs = \"\\n\\n\".join([\n", - " f\"Document {i+1}:\\n{'-'*40}\\n{doc.page_content}\"\n", - " for i, doc in enumerate(docs)\n", - " ])\n", - " return formatted_docs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Creating CrewAI Agents\n", - "\n", - "We'll create two specialized AI agents using the CrewAI framework to handle different aspects of our information retrieval and analysis system:\n", - "\n", - "## Research Expert Agent\n", - "This agent is designed to:\n", - "- Execute semantic searches using our vector store\n", - "- Analyze and evaluate search results \n", - "- Identify key information and insights\n", - "- Verify facts across multiple sources\n", - "- Synthesize findings into comprehensive research summaries\n", - "\n", - "## Technical Writer Agent \n", - "This agent is responsible for:\n", - "- Taking research findings and structuring them logically\n", - "- Converting technical concepts into clear explanations\n", - "- Ensuring proper citation and attribution\n", - "- Maintaining engaging yet informative tone\n", - "- Producing well-formatted final outputs\n", - "\n", - "The agents work together in a coordinated way:\n", - "1. Research agent finds and analyzes relevant documents\n", - "2. Writer agent takes those findings and crafts polished responses\n", - "3. Both agents use a custom response template for consistent output\n", - "\n", - "This multi-agent approach allows us to:\n", - "- Leverage specialized expertise for different tasks\n", - "- Maintain high quality through separation of concerns\n", - "- Create more comprehensive and reliable outputs\n", - "- Scale the system's capabilities efficiently" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Agents created successfully\n" - ] - } - ], - "source": [ - "# Custom response template\n", - "response_template = \"\"\"\n", - "Analysis Results\n", - "===============\n", - "{%- if .Response %}\n", - "{{ .Response }}\n", - "{%- endif %}\n", - "\n", - "Sources\n", - "=======\n", - "{%- for tool in .Tools %}\n", - "* {{ tool.name }}\n", - "{%- endfor %}\n", - "\n", - "Metadata\n", - "========\n", - "* Confidence: {{ .Confidence }}\n", - "* Analysis Time: {{ .ExecutionTime }}\n", - "\"\"\"\n", - "\n", - "# Create research agent\n", - "researcher = Agent(\n", - " role='Research Expert',\n", - " goal='Find and analyze the most relevant documents to answer user queries accurately',\n", - " backstory=\"\"\"You are an expert researcher with deep knowledge in information retrieval \n", - " and analysis. Your expertise lies in finding, evaluating, and synthesizing information \n", - " from various sources. You have a keen eye for detail and can identify key insights \n", - " from complex documents. You always verify information across multiple sources and \n", - " provide comprehensive, accurate analyses.\"\"\",\n", - " tools=[search_tool],\n", - " llm=llm,\n", - " verbose=True,\n", - " memory=True,\n", - " allow_delegation=False,\n", - " response_template=response_template\n", - ")\n", - "\n", - "# Create writer agent\n", - "writer = Agent(\n", - " role='Technical Writer',\n", - " goal='Generate clear, accurate, and well-structured responses based on research findings',\n", - " backstory=\"\"\"You are a skilled technical writer with expertise in making complex \n", - " information accessible and engaging. You excel at organizing information logically, \n", - " explaining technical concepts clearly, and creating well-structured documents. You \n", - " ensure all information is properly cited, accurate, and presented in a user-friendly \n", - " manner. You have a talent for maintaining the reader's interest while conveying \n", - " detailed technical information.\"\"\",\n", - " llm=llm,\n", - " verbose=True,\n", - " memory=True,\n", - " allow_delegation=False,\n", - " response_template=response_template\n", - ")\n", - "\n", - "print(\"Agents created successfully\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## How CrewAI Agents Work in this RAG System\n", - "\n", - "### Agent-Based RAG Architecture\n", - "\n", - "This system uses a two-agent approach to implement Retrieval-Augmented Generation (RAG):\n", - "\n", - "1. **Research Expert Agent**:\n", - " - Receives the user query\n", - " - Uses the vector search tool to retrieve relevant documents from Couchbase\n", - " - Analyzes and synthesizes information from retrieved documents\n", - " - Produces a comprehensive research summary with key findings\n", - "\n", - "2. **Technical Writer Agent**:\n", - " - Takes the research summary as input\n", - " - Structures and formats the information\n", - " - Creates a polished, user-friendly response\n", - " - Ensures proper attribution and citation\n", - "\n", - "#### How the Process Works:\n", - "\n", - "1. **Query Processing**: User query is passed to the Research Agent\n", - "2. **Vector Search**: Query is converted to embeddings and matched against document vectors\n", - "3. **Document Retrieval**: Most similar documents are retrieved from Couchbase\n", - "4. **Analysis**: Research Agent analyzes documents for relevance and extracts key information\n", - "5. **Synthesis**: Research Agent combines findings into a coherent summary\n", - "6. **Refinement**: Writer Agent restructures and enhances the content\n", - "7. **Response Generation**: Final polished response is returned to the user\n", - "\n", - "This multi-agent approach separates concerns (research vs. writing) and leverages\n", - "specialized expertise for each task, resulting in higher quality responses.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Testing the Search System\n", - "\n", - "Test the system with some example queries." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "def process_query(query, researcher, writer):\n", - " \"\"\"\n", - " Test the complete RAG system with a user query.\n", - " \n", - " This function tests both the vector search capability and the agent-based processing:\n", - " 1. Vector search: Retrieves relevant documents from Couchbase\n", - " 2. Agent processing: Uses CrewAI agents to analyze and format the response\n", - " \n", - " The function measures performance and displays detailed outputs from each step.\n", - " \"\"\"\n", - " print(f\"\\nQuery: {query}\")\n", - " print(\"-\" * 80)\n", - " \n", - " # Create tasks\n", - " research_task = Task(\n", - " description=f\"Research and analyze information relevant to: {query}\",\n", - " agent=researcher,\n", - " expected_output=\"A detailed analysis with key findings and supporting evidence\"\n", - " )\n", - " \n", - " writing_task = Task(\n", - " description=\"Create a comprehensive and well-structured response\",\n", - " agent=writer,\n", - " expected_output=\"A clear, comprehensive response that answers the query\",\n", - " context=[research_task]\n", - " )\n", - " \n", - " # Create and execute crew\n", - " crew = Crew(\n", - " agents=[researcher, writer],\n", - " tasks=[research_task, writing_task],\n", - " process=Process.sequential,\n", - " verbose=True,\n", - " cache=True,\n", - " planning=True\n", - " )\n", - " \n", - " try:\n", - " start_time = time.time()\n", - " result = crew.kickoff()\n", - " elapsed_time = time.time() - start_time\n", - " \n", - " print(f\"\\nQuery completed in {elapsed_time:.2f} seconds\")\n", - " print(\"=\" * 80)\n", - " print(\"RESPONSE\")\n", - " print(\"=\" * 80)\n", - " print(result)\n", - " \n", - " if hasattr(result, 'tasks_output'):\n", - " print(\"\\n\" + \"=\" * 80)\n", - " print(\"DETAILED TASK OUTPUTS\")\n", - " print(\"=\" * 80)\n", - " for task_output in result.tasks_output:\n", - " print(f\"\\nTask: {task_output.description[:100]}...\")\n", - " print(\"-\" * 40)\n", - " print(f\"Output: {task_output.raw}\")\n", - " print(\"-\" * 40)\n", - " except Exception as e:\n", - " print(f\"Error executing crew: {str(e)}\")\n", - " logging.error(f\"Crew execution failed: {str(e)}\", exc_info=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query: What are the key details about the FA Cup third round draw? Include information about Manchester United vs Arsenal, Tamworth vs Tottenham, and other notable fixtures.\n", - "--------------------------------------------------------------------------------\n" - ] - }, - { - "data": { - "text/html": [ - "
╭──────────────────────────────────────────── Crew Execution Started ─────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Crew Execution Started                                                                                         \n",
-              "  Name: crew                                                                                                     \n",
-              "  ID: 02c49af6-ffe5-4bea-8cba-f3f08049625d                                                                       \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[36m╭─\u001b[0m\u001b[36m───────────────────────────────────────────\u001b[0m\u001b[36m Crew Execution Started \u001b[0m\u001b[36m────────────────────────────────────────────\u001b[0m\u001b[36m─╮\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[1;36mCrew Execution Started\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[36mcrew\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mID: \u001b[0m\u001b[36m02c49af6-ffe5-4bea-8cba-f3f08049625d\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m│\u001b[0m \u001b[36m│\u001b[0m\n", - "\u001b[36m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[1m\u001b[93m \n", - "[2025-09-17 14:36:58][INFO]: Planning the crew execution\u001b[00m\n", - "[EventBus Error] Handler 'on_task_started' failed for event 'TaskStartedEvent': 'NoneType' object has no attribute 'key'\n" - ] - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭──────────────────────────────────────────────── Task Completion ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Task Completed                                                                                                 \n",
-              "  Name: 5d4df0c5-14ad-47d7-8412-2cb8438a65df                                                                     \n",
-              "  Agent: Task Execution Planner                                                                                  \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m5d4df0c5-14ad-47d7-8412-2cb8438a65df\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mTask Execution Planner\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭──────────────────────────────────────────── 🔧 Agent Tool Execution ────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Agent: Research Expert                                                                                         \n",
-              "                                                                                                                 \n",
-              "  Thought: Thought: To gather detailed information about the FA Cup third round draw, specifically focusing on   \n",
-              "  the matches Manchester United vs Arsenal and Tamworth vs Tottenham, I will perform a vector search using a     \n",
-              "  relevant query.                                                                                                \n",
-              "                                                                                                                 \n",
-              "  Using Tool: vector_search                                                                                      \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[35m╭─\u001b[0m\u001b[35m───────────────────────────────────────────\u001b[0m\u001b[35m 🔧 Agent Tool Execution \u001b[0m\u001b[35m───────────────────────────────────────────\u001b[0m\u001b[35m─╮\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mResearch Expert\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mThought: \u001b[0m\u001b[92mThought: To gather detailed information about the FA Cup third round draw, specifically focusing on \u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[92mthe matches Manchester United vs Arsenal and Tamworth vs Tottenham, I will perform a vector search using a \u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[92mrelevant query.\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[37mUsing Tool: \u001b[0m\u001b[1;92mvector_search\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m│\u001b[0m \u001b[35m│\u001b[0m\n", - "\u001b[35m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
╭────────────────────────────────────────────────── Tool Input ───────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  \"{\\\"query\\\": \\\"FA Cup third round draw Manchester United vs Arsenal Tamworth vs Tottenham\\\"}\"                  \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[34m╭─\u001b[0m\u001b[34m─────────────────────────────────────────────────\u001b[0m\u001b[34m Tool Input \u001b[0m\u001b[34m──────────────────────────────────────────────────\u001b[0m\u001b[34m─╮\u001b[0m\n", - "\u001b[34m│\u001b[0m \u001b[34m│\u001b[0m\n", - "\u001b[34m│\u001b[0m \u001b[38;2;230;219;116;49m\"{\\\"query\\\": \\\"FA Cup third round draw Manchester United vs Arsenal Tamworth vs Tottenham\\\"}\"\u001b[0m \u001b[34m│\u001b[0m\n", - "\u001b[34m│\u001b[0m \u001b[34m│\u001b[0m\n", - "\u001b[34m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭──────────────────────────────────────────────── Task Completion ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Task Completed                                                                                                 \n",
-              "  Name: d883be8b-ac2a-4678-80b3-afdc803bd716                                                                     \n",
-              "  Agent: Research Expert                                                                                         \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[32md883be8b-ac2a-4678-80b3-afdc803bd716\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mResearch Expert\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n"
-            ],
-            "text/plain": []
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        },
-        {
-          "data": {
-            "text/html": [
-              "
╭──────────────────────────────────────────────── Task Completion ────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              "  Task Completed                                                                                                 \n",
-              "  Name: 674a305d-1a6f-4b60-9497-ff4140f0f473                                                                     \n",
-              "  Agent: Technical Writer                                                                                        \n",
-              "  Tool Args:                                                                                                     \n",
-              "                                                                                                                 \n",
-              "                                                                                                                 \n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[32m╭─\u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m───────────────────────────────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m674a305d-1a6f-4b60-9497-ff4140f0f473\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mTechnical Writer\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\n", - "\u001b[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-              "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query completed in 38.89 seconds\n", - "================================================================================\n", - "RESPONSE\n", - "================================================================================\n", - "**FA Cup Third Round Draw: A Comprehensive Overview**\n", - "\n", - "The FA Cup third round draw is a pivotal moment in the English football calendar, marking the entry of Premier League and Championship clubs into the competition. This stage often brings thrilling encounters and the potential for giant-killing acts, capturing the imagination of fans worldwide. The significance of the third round is underscored by the rich history and tradition of the FA Cup, the world's oldest national football competition.\n", - "\n", - "**Manchester United vs Arsenal**\n", - "\n", - "One of the standout fixtures of the third round is the clash between Manchester United and Arsenal. This match is set to take place over the weekend of Saturday, 11 January. Manchester United, the current holders of the FA Cup, will travel to face Arsenal, who have won the competition a record 14 times. The match is significant as it involves two of the most successful clubs in FA Cup history, both known for their storied pasts and passionate fanbases.\n", - "\n", - "- **Date and Venue:** Weekend of Saturday, 11 January, at Arsenal's home ground.\n", - "- **Team Statistics:** Manchester United have lifted the FA Cup 13 times, while Arsenal hold the record with 14 victories.\n", - "- **Recent Form:** Manchester United recently triumphed over Manchester City to claim their 13th FA Cup title, showcasing their competitive edge.\n", - "- **Predictions and Insights:** Given the historical rivalry and the stakes involved, this fixture promises to be a fiercely contested battle, with both teams eager to progress further in the tournament.\n", - "\n", - "**Tamworth vs Tottenham**\n", - "\n", - "Another intriguing fixture is the match between non-league side Tamworth and Premier League club Tottenham Hotspur. Tamworth, one of only two non-league clubs remaining in the competition, will host Spurs, highlighting the classic \"David vs Goliath\" narrative that the FA Cup is renowned for.\n", - "\n", - "- **Date and Venue:** To be played at Tamworth's home ground over the weekend of Saturday, 11 January.\n", - "- **Team Statistics:** Tamworth is the lowest-ranked team remaining in the competition, while Tottenham is a well-established Premier League club.\n", - "- **Recent Form:** Tamworth secured their place in the third round with a dramatic penalty shootout victory against League One side Burton Albion.\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "# Disable logging before running the query\n", - "logging.disable(logging.CRITICAL)\n", - "\n", - "query = \"What are the key details about the FA Cup third round draw? Include information about Manchester United vs Arsenal, Tamworth vs Tottenham, and other notable fixtures.\"\n", - "process_query(query, researcher, writer)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "By following these steps, you've built a powerful RAG system that combines Couchbase's vector storage capabilities with CrewAI's agent-based architecture. This multi-agent approach separates research and writing concerns, resulting in higher quality responses to user queries.\n", - "\n", - "The system demonstrates several key advantages:\n", - "1. Efficient vector search using Couchbase's vector store\n", - "2. Specialized AI agents that focus on different aspects of the RAG pipeline\n", - "3. Collaborative workflow between agents to produce comprehensive, well-structured responses\n", - "4. Scalable architecture that can be extended with additional agents for more complex tasks\n", - "\n", - "Whether you're building a customer support system, a research assistant, or a knowledge management solution, this agent-based RAG approach provides a flexible foundation that can be adapted to various use cases and domains." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.7" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/crewai/gsi/RAG_with_Couchbase_and_CrewAI.ipynb b/crewai/gsi/RAG_with_Couchbase_and_CrewAI.ipynb deleted file mode 100644 index ddd66fe4..00000000 --- a/crewai/gsi/RAG_with_Couchbase_and_CrewAI.ipynb +++ /dev/null @@ -1,1578 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "82d610e0", - "metadata": {}, - "source": [ - "# Agent-Based RAG with Couchbase GSI Vector Search and CrewAI" - ] - }, - { - "cell_type": "markdown", - "id": "a3073978", - "metadata": {}, - "source": [ - "## Overview" - ] - }, - { - "cell_type": "markdown", - "id": "7e91202c", - "metadata": {}, - "source": [ - "In this guide, we will walk you through building a powerful semantic search engine using [Couchbase](https://www.couchbase.com) as the backend database and [CrewAI](https://github.com/crewAIInc/crewAI) for agent-based RAG operations. CrewAI allows us to create specialized agents that can work together to handle different aspects of the RAG workflow, from document retrieval to response generation. This tutorial uses Couchbase's **Global Secondary Index (GSI)** vector search capabilities, which offer high-performance vector search optimized for large-scale applications. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively if you want to perform semantic search using the FTS index, please take a look at [this.](https://developer.couchbase.com/tutorial-crewai-couchbase-rag-with-fts/)" - ] - }, - { - "cell_type": "markdown", - "id": "255f3178", - "metadata": {}, - "source": [ - "## How to Run This Tutorial" - ] - }, - { - "cell_type": "markdown", - "id": "4e84bba4", - "metadata": {}, - "source": [ - "This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/crewai/gsi/RAG_with_Couchbase_and_CrewAI.ipynb).\n", - "\n", - "You can either:\n", - "- Download the notebook file and run it on [Google Colab](https://colab.research.google.com)\n", - "- Run it on your system by setting up the Python environment" - ] - }, - { - "cell_type": "markdown", - "id": "202801ea", - "metadata": {}, - "source": [ - "## Prerequisites" - ] - }, - { - "cell_type": "markdown", - "id": "55bb6aae", - "metadata": {}, - "source": [ - "### Couchbase Requirements" - ] - }, - { - "cell_type": "markdown", - "id": "d318f572", - "metadata": {}, - "source": [ - "1. Create and Deploy Your Free Tier Operational cluster on [Capella](https://cloud.couchbase.com/sign-up)\n", - " - To get started with [Couchbase Capella](https://cloud.couchbase.com), create an account and use it to deploy a free tier operational cluster\n", - " - This account provides you with an environment where you can explore and learn about Capella\n", - " - To learn more, please follow the [Getting Started Guide](https://docs.couchbase.com/cloud/get-started/create-account.html)\n", - " - **Important**: This tutorial requires Couchbase Server **8.0+** for GSI vector search capabilities" - ] - }, - { - "cell_type": "markdown", - "id": "474a6a23", - "metadata": {}, - "source": [ - "### Couchbase Capella Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "e07c1ff4", - "metadata": {}, - "source": [ - "When running Couchbase using Capella, the following prerequisites need to be met:\n", - "- Create the database credentials to access the required bucket (Read and Write) used in the application\n", - "- Allow access to the Cluster from the IP on which the application is running by following the [Network Security documentation](https://docs.couchbase.com/cloud/security/security.html#public-access)" - ] - }, - { - "cell_type": "markdown", - "id": "b223faba", - "metadata": {}, - "source": [ - "## Setup and Installation" - ] - }, - { - "cell_type": "markdown", - "id": "81251293", - "metadata": {}, - "source": [ - "### Installing Necessary Libraries" - ] - }, - { - "cell_type": "markdown", - "id": "21e51e49", - "metadata": {}, - "source": [ - "We'll install the following key libraries:\n", - "- `datasets`: For loading and managing our training data\n", - "- `langchain-couchbase`: To integrate Couchbase with LangChain for GSI vector storage and caching\n", - "- `langchain-openai`: For accessing OpenAI's embedding and chat models\n", - "- `crewai`: To create and orchestrate our AI agents for RAG operations\n", - "- `python-dotenv`: For securely managing environment variables and API keys\n", - "\n", - "These libraries provide the foundation for building a semantic search engine with GSI vector embeddings, database integration, and agent-based RAG capabilities." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a666ce8b", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet datasets==4.1.0 langchain-couchbase==0.5.0 langchain-openai==0.3.33 crewai==0.186.1 python-dotenv==1.1.1" - ] - }, - { - "cell_type": "markdown", - "id": "e5d980e7", - "metadata": {}, - "source": [ - "### Import Required Modules" - ] - }, - { - "cell_type": "markdown", - "id": "94b2e73b", - "metadata": {}, - "source": [ - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "5d013a55", - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "from uuid import uuid4\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.diagnostics import PingState, ServiceType\n", - "from couchbase.exceptions import (InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - " ServiceUnavailableException,\n", - " CouchbaseException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from crewai.tools import tool\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy, IndexType\n", - "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n", - "\n", - "from crewai import Agent, Crew, Process, Task" - ] - }, - { - "cell_type": "markdown", - "id": "fb7d108a", - "metadata": {}, - "source": [ - "### Configure Logging" - ] - }, - { - "cell_type": "markdown", - "id": "a65cf252", - "metadata": {}, - "source": [ - "Logging is configured to track the progress of the script and capture any errors or warnings." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "9e719ffc", - "metadata": {}, - "outputs": [], - "source": [ - "logging.basicConfig(\n", - " level=logging.INFO,\n", - " format='%(asctime)s [%(levelname)s] %(message)s',\n", - " datefmt='%Y-%m-%d %H:%M:%S'\n", - ")\n", - "\n", - "# Suppress httpx logging\n", - "logging.getLogger('httpx').setLevel(logging.CRITICAL)" - ] - }, - { - "cell_type": "markdown", - "id": "3690497b", - "metadata": {}, - "source": [ - "### Load Environment Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "653fc54f", - "metadata": {}, - "source": [ - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script uses environment variables to store sensitive information, enhancing the overall security and maintainability of your code by avoiding hardcoded values." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "3aaf9289", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Configuration loaded successfully\n" - ] - } - ], - "source": [ - "# Load environment variables\n", - "load_dotenv(\"./.env\")\n", - "\n", - "# Configuration\n", - "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or input(\"Enter your OpenAI API key: \")\n", - "if not OPENAI_API_KEY:\n", - " raise ValueError(\"OPENAI_API_KEY is not set\")\n", - "\n", - "CB_HOST = os.getenv('CB_HOST') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or 'vector-search-testing'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or 'crew'\n", - "\n", - "print(\"Configuration loaded successfully\")" - ] - }, - { - "cell_type": "markdown", - "id": "7fa87d96", - "metadata": {}, - "source": [ - "## Couchbase Connection Setup" - ] - }, - { - "cell_type": "markdown", - "id": "3c30a607", - "metadata": {}, - "source": [ - "### Connect to Cluster" - ] - }, - { - "cell_type": "markdown", - "id": "996466dc", - "metadata": {}, - "source": [ - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "979bd5e7", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "# Connect to Couchbase\n", - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " print(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " print(f\"Failed to connect to Couchbase: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "id": "7fc61a9b", - "metadata": {}, - "source": [ - "### Setup Collections" - ] - }, - { - "cell_type": "markdown", - "id": "b78b48da", - "metadata": {}, - "source": [ - "Create and configure Couchbase bucket, scope, and collection for storing our vector data.\n", - "\n", - "1. **Bucket Creation:**\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: If you are using Capella, create a bucket manually called vector-search-testing(or any name you prefer) with the same properties.\n", - "\n", - "2. **Scope Management:** \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. **Collection Setup:**\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "**Additional Tasks:**\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "13b79fa7", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-10-06 10:17:53 [INFO] Bucket 'vector-search-testing' exists.\n", - "2025-10-06 10:17:53 [INFO] Collection 'crew' already exists. Skipping creation.\n", - "2025-10-06 10:17:55 [INFO] All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)" - ] - }, - { - "cell_type": "markdown", - "id": "fa4faf3f", - "metadata": {}, - "source": [ - "## Understanding GSI Vector Search" - ] - }, - { - "cell_type": "markdown", - "id": "69c7d28f", - "metadata": {}, - "source": [ - "### GSI Vector Index Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "90080454", - "metadata": {}, - "source": [ - "Semantic search with GSI requires creating a Global Secondary Index optimized for vector operations. Unlike FTS-based vector search, GSI vector indexes offer two distinct types optimized for different use cases:" - ] - }, - { - "cell_type": "markdown", - "id": "72154198", - "metadata": {}, - "source": [ - "#### GSI Vector Index Types" - ] - }, - { - "cell_type": "markdown", - "id": "b55cb1f4", - "metadata": {}, - "source": [ - "##### Hyperscale Vector Indexes (BHIVE)" - ] - }, - { - "cell_type": "markdown", - "id": "4a80376e", - "metadata": {}, - "source": [ - "- **Best for**: Pure vector searches like content discovery, recommendations, and semantic search\n", - "- **Performance**: High performance with low memory footprint, optimized for concurrent operations\n", - "- **Scalability**: Designed to scale to billions of vectors\n", - "- **Use when**: You primarily perform vector-only queries without complex scalar filtering" - ] - }, - { - "cell_type": "markdown", - "id": "1cbcbf25", - "metadata": {}, - "source": [ - "##### Composite Vector Indexes" - ] - }, - { - "cell_type": "markdown", - "id": "204f1bcc", - "metadata": {}, - "source": [ - "- **Best for**: Filtered vector searches that combine vector search with scalar value filtering\n", - "- **Performance**: Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", - "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- **Note**: Scalar filters take precedence over vector similarity" - ] - }, - { - "cell_type": "markdown", - "id": "5ebefc2f", - "metadata": {}, - "source": [ - "#### Understanding Index Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "93e4bff1", - "metadata": {}, - "source": [ - "The `index_description` parameter controls how Couchbase optimizes vector storage and search through centroids and quantization:\n", - "\n", - "**Format**: `'IVF[],{PQ|SQ}'`\n", - "\n", - "**Centroids (IVF - Inverted File):**\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- More centroids = faster search, slower training \n", - "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", - "\n", - "**Quantization Options:**\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", - "- Higher values = better accuracy, larger index size\n", - "\n", - "**Common Examples:**\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "88e4c207", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "GSI vector index configuration prepared\n" - ] - } - ], - "source": [ - "# GSI Vector Index Configuration\n", - "# Unlike FTS indexes, GSI vector indexes are created programmatically through the vector store\n", - "# We'll configure the parameters that will be used for index creation\n", - "\n", - "# Vector configuration\n", - "DISTANCE_STRATEGY = DistanceStrategy.COSINE # Cosine similarity\n", - "INDEX_TYPE = IndexType.BHIVE # Using BHIVE for high-performance vector \n", - "INDEX_DESCRIPTION = \"IVF,SQ8\" # Auto-selected centroids with 8-bit scalar quantization\n", - "\n", - "# To create a Composite Index instead, use the following:\n", - "# INDEX_TYPE = IndexType.COMPOSITE # Combines vector search with scalar filtering\n", - "\n", - "print(\"GSI vector index configuration prepared\")" - ] - }, - { - "cell_type": "markdown", - "id": "7de9d300", - "metadata": {}, - "source": [ - "### Alternative: Composite Index Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "601a9d35", - "metadata": {}, - "source": [ - "If your use case requires complex filtering with scalar attributes, you can create a **Composite index** instead by changing the configuration:\n", - "\n", - "```python\n", - "# Alternative configuration for Composite index\n", - "INDEX_TYPE = IndexType.COMPOSITE # Instead of IndexType.BHIVE\n", - "INDEX_DESCRIPTION = \"IVF,SQ8\" # Same quantization settings\n", - "DISTANCE_STRATEGY = DistanceStrategy.COSINE # Same distance metric\n", - "\n", - "# The rest of the setup remains identical\n", - "```\n", - "\n", - "**Use Composite indexes when:**\n", - "- You need to filter by document metadata or attributes before vector similarity\n", - "- Your queries combine vector search with WHERE clauses \n", - "- You have well-defined filtering requirements that can reduce the search space\n", - "\n", - "**Note**: The index creation process is identical - just change the `INDEX_TYPE`. Composite indexes enable pre-filtering with scalar attributes, making them ideal for applications requiring complex query patterns with metadata filtering." - ] - }, - { - "cell_type": "markdown", - "id": "2301cc63", - "metadata": {}, - "source": [ - "## OpenAI Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "99257fdb", - "metadata": {}, - "source": [ - "This section initializes two key OpenAI components needed for our RAG system:\n", - "\n", - "1. **OpenAI Embeddings:**\n", - " - Uses the 'text-embedding-3-small' model\n", - " - Converts text into high-dimensional vector representations (embeddings)\n", - " - These embeddings enable semantic search by capturing the meaning of text\n", - " - Required for vector similarity search in Couchbase\n", - "\n", - "2. **ChatOpenAI Language Model:**\n", - " - Uses the 'gpt-4o' model\n", - " - Temperature set to 0.2 for balanced creativity and focus\n", - " - Serves as the cognitive engine for CrewAI agents\n", - " - Powers agent reasoning, decision-making, and task execution\n", - " - Enables agents to:\n", - " - Process and understand retrieved context from vector search\n", - " - Generate thoughtful responses based on that context\n", - " - Follow instructions defined in agent roles and goals\n", - " - Collaborate with other agents in the crew\n", - " - The relatively low temperature (0.2) ensures agents produce reliable, consistent outputs while maintaining some creative problem-solving ability\n", - "\n", - "Both components require a valid OpenAI API key (OPENAI_API_KEY) for authentication.\n", - "In the CrewAI framework, the LLM acts as the \"brain\" for each agent, allowing them to interpret tasks, retrieve relevant information via the RAG system, and generate appropriate outputs based on their specialized roles and expertise." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "d9e6fd1a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "OpenAI components initialized\n" - ] - } - ], - "source": [ - "# Initialize OpenAI components\n", - "embeddings = OpenAIEmbeddings(\n", - " openai_api_key=OPENAI_API_KEY,\n", - " model=\"text-embedding-3-small\"\n", - ")\n", - "\n", - "llm = ChatOpenAI(\n", - " openai_api_key=OPENAI_API_KEY,\n", - " model=\"gpt-4o\",\n", - " temperature=0.2\n", - ")\n", - "\n", - "print(\"OpenAI components initialized\")" - ] - }, - { - "cell_type": "markdown", - "id": "902067c7", - "metadata": {}, - "source": [ - "## Document Processing and Vector Store Setup" - ] - }, - { - "cell_type": "markdown", - "id": "7340c7ce", - "metadata": {}, - "source": [ - "### Create Couchbase GSI Vector Store" - ] - }, - { - "cell_type": "markdown", - "id": "2d202628", - "metadata": {}, - "source": [ - "Set up the GSI vector store where we'll store document embeddings for high-performance semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "a877a51d", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-10-06 10:18:05 [INFO] GSI Vector store setup completed\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "GSI Vector store initialized successfully\n" - ] - } - ], - "source": [ - "# Setup GSI vector store with OpenAI embeddings\n", - "try:\n", - " vector_store = CouchbaseQueryVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " distance_metric=DISTANCE_STRATEGY\n", - " )\n", - " print(\"GSI Vector store initialized successfully\")\n", - " logging.info(\"GSI Vector store setup completed\")\n", - "except Exception as e:\n", - " logging.error(f\"Failed to initialize GSI vector store: {str(e)}\")\n", - " raise RuntimeError(f\"GSI Vector store initialization failed: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "6194a58c", - "metadata": {}, - "source": [ - "### Load BBC News Dataset" - ] - }, - { - "cell_type": "markdown", - "id": "f1dad8d9", - "metadata": {}, - "source": [ - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "13b20d27", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-10-06 10:18:13 [INFO] Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "c7592356", - "metadata": {}, - "source": [ - "#### Data Cleaning" - ] - }, - { - "cell_type": "markdown", - "id": "f2ad46f0", - "metadata": {}, - "source": [ - "Remove duplicate articles for cleaner search results." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "496e3afc", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "id": "69e3853c", - "metadata": {}, - "source": [ - "#### Save Data to Vector Store" - ] - }, - { - "cell_type": "markdown", - "id": "814d5e49", - "metadata": {}, - "source": [ - "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. **Memory Efficiency**: Processing in smaller batches prevents memory overload\n", - "2. **Error Handling**: If an error occurs, only the current batch is affected\n", - "3. **Progress Tracking**: Easier to monitor and track the ingestion progress\n", - "4. **Resource Management**: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 50 to ensure reliable operation. The optimal batch size depends on many factors including document sizes, available system resources, network conditions, and concurrent workload." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "188dcccd", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-10-06 10:19:43 [INFO] Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 50\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "bd60ee6d", - "metadata": {}, - "source": [ - "## Vector Search Performance Testing" - ] - }, - { - "cell_type": "markdown", - "id": "51df07c1", - "metadata": {}, - "source": [ - "Now let's demonstrate the performance benefits of GSI optimization by testing pure vector search performance. We'll compare three optimization levels:\n", - "\n", - "1. **Baseline Performance**: Vector search without GSI optimization\n", - "2. **GSI-Optimized Performance**: Same search with BHIVE GSI index\n", - "3. **Cache Benefits**: Show how caching can be applied on top of GSI for repeated queries\n", - "\n", - "**Important**: This testing focuses on pure vector search performance, isolating the GSI improvements from other workflow overhead." - ] - }, - { - "cell_type": "markdown", - "id": "c9d72167", - "metadata": {}, - "source": [ - "### Create Vector Search Function" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "c43f62fa", - "metadata": {}, - "outputs": [], - "source": [ - "import time\n", - "\n", - "# Create GSI vector retriever optimized for high-performance searches\n", - "retriever = vector_store.as_retriever(\n", - " search_type=\"similarity\",\n", - " search_kwargs={\"k\": 4} # Return top 4 most similar documents\n", - ")\n", - "\n", - "def test_vector_search_performance(query_text, label=\"Vector Search\"):\n", - " \"\"\"Test pure vector search performance and return timing metrics\"\"\"\n", - " print(f\"\\n[{label}] Testing vector search performance\")\n", - " print(f\"[{label}] Query: '{query_text}'\")\n", - " \n", - " start_time = time.time()\n", - " \n", - " try:\n", - " # Perform vector search using the retriever\n", - " docs = retriever.invoke(query_text)\n", - " end_time = time.time()\n", - " \n", - " search_time = end_time - start_time\n", - " print(f\"[{label}] Vector search completed in {search_time:.4f} seconds\")\n", - " print(f\"[{label}] Found {len(docs)} relevant documents\")\n", - " \n", - " # Show a preview of the first result\n", - " if docs:\n", - " preview = docs[0].page_content[:100] + \"...\" if len(docs[0].page_content) > 100 else docs[0].page_content\n", - " print(f\"[{label}] Top result preview: {preview}\")\n", - " \n", - " return search_time\n", - " except Exception as e:\n", - " print(f\"[{label}] Vector search failed: {str(e)}\")\n", - " return None" - ] - }, - { - "cell_type": "markdown", - "id": "f939b9e1", - "metadata": {}, - "source": [ - "### Test 1: Baseline Performance (No GSI Index)" - ] - }, - { - "cell_type": "markdown", - "id": "e20d10ad", - "metadata": {}, - "source": [ - "Test pure vector search performance without GSI optimization." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "71ceaa56", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Testing baseline vector search performance without GSI optimization...\n", - "\n", - "[Baseline Search] Testing vector search performance\n", - "[Baseline Search] Query: 'What are the latest developments in football transfers?'\n", - "[Baseline Search] Vector search completed in 1.3999 seconds\n", - "[Baseline Search] Found 4 relevant documents\n", - "[Baseline Search] Top result preview: The latest updates and analysis from the BBC.\n", - "\n", - "Baseline vector search time (without GSI): 1.3999 seconds\n", - "\n" - ] - } - ], - "source": [ - "# Test baseline vector search performance without GSI index\n", - "test_query = \"What are the latest developments in football transfers?\"\n", - "print(\"Testing baseline vector search performance without GSI optimization...\")\n", - "baseline_time = test_vector_search_performance(test_query, \"Baseline Search\")\n", - "print(f\"\\nBaseline vector search time (without GSI): {baseline_time:.4f} seconds\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "90d304e9", - "metadata": {}, - "source": [ - "### Create BHIVE GSI Index" - ] - }, - { - "cell_type": "markdown", - "id": "6ae9cef0", - "metadata": {}, - "source": [ - "Now let's create a BHIVE GSI vector index to enable high-performance vector searches. The index creation is done programmatically through the vector store, which will optimize the index settings based on our data and requirements." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "389d1358", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Creating BHIVE GSI vector index...\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-10-06 10:20:15 [INFO] BHIVE index created with description 'IVF,SQ8'\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "GSI Vector index created successfully\n", - "Waiting for index to become available...\n" - ] - } - ], - "source": [ - "# Create GSI Vector Index for high-performance searches\n", - "print(\"Creating BHIVE GSI vector index...\")\n", - "try:\n", - " # Create a BHIVE index optimized for pure vector searches\n", - " vector_store.create_index(\n", - " index_type=INDEX_TYPE, # BHIVE index type\n", - " index_description=INDEX_DESCRIPTION # IVF,SQ8 for optimized performance\n", - " )\n", - " print(f\"GSI Vector index created successfully\")\n", - " logging.info(f\"BHIVE index created with description '{INDEX_DESCRIPTION}'\")\n", - " \n", - " # Wait a moment for index to be available\n", - " print(\"Waiting for index to become available...\")\n", - " time.sleep(5)\n", - " \n", - "except Exception as e:\n", - " # Index might already exist, which is fine\n", - " if \"already exists\" in str(e).lower():\n", - " print(f\"GSI Vector index already exists, proceeding...\")\n", - " logging.info(f\"Index already exists\")\n", - " else:\n", - " logging.error(f\"Failed to create GSI index: {str(e)}\")\n", - " raise RuntimeError(f\"GSI index creation failed: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "6b9e5763", - "metadata": {}, - "source": [ - "### Test 2: GSI-Optimized Performance" - ] - }, - { - "cell_type": "markdown", - "id": "8388f41b", - "metadata": {}, - "source": [ - "Test the same vector search with BHIVE GSI optimization." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "b1b89f5b", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Testing vector search performance with BHIVE GSI optimization...\n", - "\n", - "[GSI-Optimized Search] Testing vector search performance\n", - "[GSI-Optimized Search] Query: 'What are the latest developments in football transfers?'\n", - "[GSI-Optimized Search] Vector search completed in 0.5885 seconds\n", - "[GSI-Optimized Search] Found 4 relevant documents\n", - "[GSI-Optimized Search] Top result preview: Four key areas for Everton's new owners to address\n", - "\n", - "Everton fans last saw silverware in 1995 when th...\n" - ] - } - ], - "source": [ - "# Test vector search performance with GSI index\n", - "print(\"Testing vector search performance with BHIVE GSI optimization...\")\n", - "gsi_search_time = test_vector_search_performance(test_query, \"GSI-Optimized Search\")" - ] - }, - { - "cell_type": "markdown", - "id": "eba6c37a", - "metadata": {}, - "source": [ - "### Test 3: Cache Benefits Testing" - ] - }, - { - "cell_type": "markdown", - "id": "1cc73249", - "metadata": {}, - "source": [ - "Now let's demonstrate how caching can improve performance for repeated queries. **Note**: Caching benefits apply to both baseline and GSI-optimized searches." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "3850c8fc", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Testing cache benefits with vector search...\n", - "First execution (cache miss):\n", - "\n", - "[Cache Test - First Run] Testing vector search performance\n", - "[Cache Test - First Run] Query: 'What happened in the latest Premier League matches?'\n", - "[Cache Test - First Run] Vector search completed in 0.6450 seconds\n", - "[Cache Test - First Run] Found 4 relevant documents\n", - "[Cache Test - First Run] Top result preview: Who has made Troy's Premier League team of the week?\n", - "\n", - "After every round of Premier League matches th...\n", - "\n", - "Second execution (cache hit - should be faster):\n", - "\n", - "[Cache Test - Second Run] Testing vector search performance\n", - "[Cache Test - Second Run] Query: 'What happened in the latest Premier League matches?'\n", - "[Cache Test - Second Run] Vector search completed in 0.4306 seconds\n", - "[Cache Test - Second Run] Found 4 relevant documents\n", - "[Cache Test - Second Run] Top result preview: Who has made Troy's Premier League team of the week?\n", - "\n", - "After every round of Premier League matches th...\n" - ] - } - ], - "source": [ - "# Test cache benefits with a different query to avoid interference\n", - "cache_test_query = \"What happened in the latest Premier League matches?\"\n", - "\n", - "print(\"Testing cache benefits with vector search...\")\n", - "print(\"First execution (cache miss):\")\n", - "cache_time_1 = test_vector_search_performance(cache_test_query, \"Cache Test - First Run\")\n", - "\n", - "print(\"\\nSecond execution (cache hit - should be faster):\")\n", - "cache_time_2 = test_vector_search_performance(cache_test_query, \"Cache Test - Second Run\")" - ] - }, - { - "cell_type": "markdown", - "id": "21530f7b", - "metadata": {}, - "source": [ - "### Vector Search Performance Analysis" - ] - }, - { - "cell_type": "markdown", - "id": "009e69c4", - "metadata": {}, - "source": [ - "Let's analyze the vector search performance improvements across all optimization levels:" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "388ca617", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "================================================================================\n", - "VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", - "================================================================================\n", - "Phase 1 - Baseline Search (No GSI): 1.3999 seconds\n", - "Phase 2 - GSI-Optimized Search: 0.5885 seconds\n", - "Phase 3 - Cache Benefits:\n", - " First execution (cache miss): 0.6450 seconds\n", - " Second execution (cache hit): 0.4306 seconds\n", - "\n", - "--------------------------------------------------------------------------------\n", - "VECTOR SEARCH OPTIMIZATION IMPACT:\n", - "--------------------------------------------------------------------------------\n", - "GSI Index Benefit: 2.38x faster (58.0% improvement)\n", - "Cache Benefit: 1.50x faster (33.2% improvement)\n", - "\n", - "Key Insights for Vector Search Performance:\n", - "• GSI BHIVE indexes provide significant performance improvements for vector similarity search\n", - "• Performance gains are most dramatic for complex semantic queries\n", - "• BHIVE optimization is particularly effective for high-dimensional embeddings\n", - "• Combined with proper quantization (SQ8), GSI delivers production-ready performance\n", - "• These performance improvements directly benefit any application using the vector store\n" - ] - } - ], - "source": [ - "print(\"\\n\" + \"=\"*80)\n", - "print(\"VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", - "print(\"=\"*80)\n", - "\n", - "print(f\"Phase 1 - Baseline Search (No GSI): {baseline_time:.4f} seconds\")\n", - "print(f\"Phase 2 - GSI-Optimized Search: {gsi_search_time:.4f} seconds\")\n", - "if cache_time_1 and cache_time_2:\n", - " print(f\"Phase 3 - Cache Benefits:\")\n", - " print(f\" First execution (cache miss): {cache_time_1:.4f} seconds\")\n", - " print(f\" Second execution (cache hit): {cache_time_2:.4f} seconds\")\n", - "\n", - "print(\"\\n\" + \"-\"*80)\n", - "print(\"VECTOR SEARCH OPTIMIZATION IMPACT:\")\n", - "print(\"-\"*80)\n", - "\n", - "# GSI improvement analysis\n", - "if baseline_time and gsi_search_time:\n", - " speedup = baseline_time / gsi_search_time if gsi_search_time > 0 else float('inf')\n", - " time_saved = baseline_time - gsi_search_time\n", - " percent_improvement = (time_saved / baseline_time) * 100\n", - " print(f\"GSI Index Benefit: {speedup:.2f}x faster ({percent_improvement:.1f}% improvement)\")\n", - "\n", - "# Cache improvement analysis\n", - "if cache_time_1 and cache_time_2 and cache_time_2 < cache_time_1:\n", - " cache_speedup = cache_time_1 / cache_time_2\n", - " cache_improvement = ((cache_time_1 - cache_time_2) / cache_time_1) * 100\n", - " print(f\"Cache Benefit: {cache_speedup:.2f}x faster ({cache_improvement:.1f}% improvement)\")\n", - "else:\n", - " print(f\"Cache Benefit: Variable (depends on query complexity and caching mechanism)\")\n", - "\n", - "print(f\"\\nKey Insights for Vector Search Performance:\")\n", - "print(f\"• GSI BHIVE indexes provide significant performance improvements for vector similarity search\")\n", - "print(f\"• Performance gains are most dramatic for complex semantic queries\")\n", - "print(f\"• BHIVE optimization is particularly effective for high-dimensional embeddings\")\n", - "print(f\"• Combined with proper quantization (SQ8), GSI delivers production-ready performance\")\n", - "print(f\"• These performance improvements directly benefit any application using the vector store\")" - ] - }, - { - "cell_type": "markdown", - "id": "c7252b0c", - "metadata": {}, - "source": [ - "## CrewAI Agent Setup" - ] - }, - { - "cell_type": "markdown", - "id": "812ee93f", - "metadata": {}, - "source": [ - "### What is CrewAI?" - ] - }, - { - "cell_type": "markdown", - "id": "274671b5", - "metadata": {}, - "source": [ - "Now that we've optimized our vector search performance, let's build a sophisticated agent-based RAG system using CrewAI. CrewAI enables us to create specialized AI agents that collaborate to handle different aspects of the RAG workflow:\n", - "\n", - "- **Research Agent**: Finds and analyzes relevant documents using our optimized vector search\n", - "- **Writer Agent**: Takes research findings and creates polished, structured responses\n", - "- **Collaborative Workflow**: Agents work together, with the writer building on the researcher's findings\n", - "\n", - "This multi-agent approach produces higher-quality responses than single-agent systems by separating research and writing expertise, while benefiting from the GSI performance improvements we just demonstrated." - ] - }, - { - "cell_type": "markdown", - "id": "2bda0d68", - "metadata": {}, - "source": [ - "### Create Vector Search Tool" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "c7b379d0", - "metadata": {}, - "outputs": [], - "source": [ - "# Define the GSI vector search tool using the @tool decorator\n", - "@tool(\"gsi_vector_search\")\n", - "def search_tool(query: str) -> str:\n", - " \"\"\"Search for relevant documents using GSI vector similarity.\n", - " Input should be a simple text query string.\n", - " Returns a list of relevant document contents from GSI vector search.\n", - " Use this tool to find detailed information about topics using high-performance GSI indexes.\"\"\"\n", - " \n", - " # Invoke the GSI vector retriever (now optimized with BHIVE index)\n", - " docs = retriever.invoke(query)\n", - "\n", - " # Format the results with distance information\n", - " formatted_docs = \"\\n\\n\".join([\n", - " f\"Document {i+1}:\\n{'-'*40}\\n{doc.page_content}\"\n", - " for i, doc in enumerate(docs)\n", - " ])\n", - " return formatted_docs" - ] - }, - { - "cell_type": "markdown", - "id": "4a2a0165", - "metadata": {}, - "source": [ - "### Create CrewAI Agents" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "73c44437", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CrewAI agents created successfully with optimized GSI vector search\n" - ] - } - ], - "source": [ - "# Create research agent\n", - "researcher = Agent(\n", - " role='Research Expert',\n", - " goal='Find and analyze the most relevant documents to answer user queries accurately',\n", - " backstory=\"\"\"You are an expert researcher with deep knowledge in information retrieval \n", - " and analysis. Your expertise lies in finding, evaluating, and synthesizing information \n", - " from various sources. You have a keen eye for detail and can identify key insights \n", - " from complex documents. You always verify information across multiple sources and \n", - " provide comprehensive, accurate analyses.\"\"\",\n", - " tools=[search_tool],\n", - " llm=llm,\n", - " verbose=False,\n", - " memory=True,\n", - " allow_delegation=False\n", - ")\n", - "\n", - "# Create writer agent\n", - "writer = Agent(\n", - " role='Technical Writer',\n", - " goal='Generate clear, accurate, and well-structured responses based on research findings',\n", - " backstory=\"\"\"You are a skilled technical writer with expertise in making complex \n", - " information accessible and engaging. You excel at organizing information logically, \n", - " explaining technical concepts clearly, and creating well-structured documents. You \n", - " ensure all information is properly cited, accurate, and presented in a user-friendly \n", - " manner. You have a talent for maintaining the reader's interest while conveying \n", - " detailed technical information.\"\"\",\n", - " llm=llm,\n", - " verbose=False,\n", - " memory=True,\n", - " allow_delegation=False\n", - ")\n", - "\n", - "print(\"CrewAI agents created successfully with optimized GSI vector search\")" - ] - }, - { - "cell_type": "markdown", - "id": "a63dbf3d", - "metadata": {}, - "source": [ - "### How the Optimized RAG Workflow Works" - ] - }, - { - "cell_type": "markdown", - "id": "12bd1697", - "metadata": {}, - "source": [ - "The complete optimized RAG process:\n", - "1. **User Query** → Research Agent\n", - "2. **Vector Search** → GSI BHIVE index finds similar documents (now with proven performance improvements)\n", - "3. **Document Analysis** → Research Agent analyzes and synthesizes findings\n", - "4. **Response Writing** → Writer Agent creates polished, structured response\n", - "5. **Final Output** → User receives comprehensive, well-formatted answer\n", - "\n", - "**Key Benefit**: The vector search performance improvements we demonstrated directly enhance the agent workflow efficiency." - ] - }, - { - "cell_type": "markdown", - "id": "5ca6cc10", - "metadata": {}, - "source": [ - "## CrewAI Agent Demo" - ] - }, - { - "cell_type": "markdown", - "id": "3e7d956a", - "metadata": {}, - "source": [ - "Now let's demonstrate the complete optimized agent-based RAG system in action, benefiting from the GSI performance improvements we validated earlier." - ] - }, - { - "cell_type": "markdown", - "id": "45b1a283", - "metadata": {}, - "source": [ - "### Demo Function" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "2176b29d", - "metadata": {}, - "outputs": [], - "source": [ - "def process_interactive_query(query, researcher, writer):\n", - " \"\"\"Run complete RAG workflow with CrewAI agents using optimized GSI vector search\"\"\"\n", - " print(f\"\\nProcessing Query: {query}\")\n", - " print(\"=\" * 80)\n", - " \n", - " # Create tasks\n", - " research_task = Task(\n", - " description=f\"Research and analyze information relevant to: {query}\",\n", - " agent=researcher,\n", - " expected_output=\"A detailed analysis with key findings\"\n", - " )\n", - " \n", - " writing_task = Task(\n", - " description=\"Create a comprehensive response\",\n", - " agent=writer,\n", - " expected_output=\"A clear, well-structured answer\",\n", - " context=[research_task]\n", - " )\n", - " \n", - " # Execute crew\n", - " crew = Crew(\n", - " agents=[researcher, writer],\n", - " tasks=[research_task, writing_task],\n", - " process=Process.sequential,\n", - " verbose=True,\n", - " cache=True,\n", - " planning=True\n", - " )\n", - " \n", - " try:\n", - " start_time = time.time()\n", - " result = crew.kickoff()\n", - " elapsed_time = time.time() - start_time\n", - " \n", - " print(f\"\\nCompleted in {elapsed_time:.2f} seconds\")\n", - " print(\"=\" * 80)\n", - " print(\"RESPONSE\")\n", - " print(\"=\" * 80)\n", - " print(result)\n", - " \n", - " return elapsed_time\n", - " except Exception as e:\n", - " print(f\"Error: {str(e)}\")\n", - " return None" - ] - }, - { - "cell_type": "markdown", - "id": "a65e896e", - "metadata": {}, - "source": [ - "### Run Agent-Based RAG Demo" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d355751a", - "metadata": {}, - "outputs": [], - "source": [ - "# Disable logging for cleaner output\n", - "logging.disable(logging.CRITICAL)\n", - "\n", - "# Run demo with a sample query\n", - "demo_query = \"What are the key details about the FA Cup third round draw?\"\n", - "final_time = process_interactive_query(demo_query, researcher, writer)\n", - "\n", - "if final_time:\n", - " print(f\"\\n\\n✅ CrewAI agent demo completed successfully in {final_time:.2f} seconds\")" - ] - }, - { - "cell_type": "markdown", - "id": "0d4a24b3", - "metadata": {}, - "source": [ - "## Conclusion" - ] - }, - { - "cell_type": "markdown", - "id": "82ad950f", - "metadata": {}, - "source": [ - "You have successfully built a powerful agent-based RAG system that combines Couchbase's high-performance GSI vector storage capabilities with CrewAI's multi-agent architecture. This tutorial demonstrated the complete pipeline from data ingestion to intelligent response generation, with real performance benchmarks showing the dramatic improvements GSI indexing provides." - ] - } - ], - "metadata": { - "jupytext": { - "cell_metadata_filter": "-all", - "main_language": "python", - "notebook_metadata_filter": "-all" - }, - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.7" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/crewai/gsi/.env.sample b/crewai/query_based/.env.sample similarity index 100% rename from crewai/gsi/.env.sample rename to crewai/query_based/.env.sample diff --git a/crewai/query_based/RAG_with_Couchbase_and_CrewAI.ipynb b/crewai/query_based/RAG_with_Couchbase_and_CrewAI.ipynb new file mode 100644 index 00000000..984d3ebe --- /dev/null +++ b/crewai/query_based/RAG_with_Couchbase_and_CrewAI.ipynb @@ -0,0 +1,1578 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "82d610e0", + "metadata": {}, + "source": [ + "# Agent-Based RAG with Couchbase GSI Vector Search and CrewAI" + ] + }, + { + "cell_type": "markdown", + "id": "a3073978", + "metadata": {}, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "id": "7e91202c", + "metadata": {}, + "source": [ + "In this guide, we will walk you through building a powerful semantic search engine using [Couchbase](https://www.couchbase.com) as the backend database and [CrewAI](https://github.com/crewAIInc/crewAI) for agent-based RAG operations. CrewAI allows us to create specialized agents that can work together to handle different aspects of the RAG workflow, from document retrieval to response generation. This tutorial uses Couchbase's **Global Secondary Index (GSI)** vector search capabilities, which offer high-performance vector search optimized for large-scale applications. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Search Vector Index, please take a look at [this.](https://developer.couchbase.com/tutorial-crewai-couchbase-rag-with-search-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "id": "255f3178", + "metadata": {}, + "source": [ + "## How to Run This Tutorial" + ] + }, + { + "cell_type": "markdown", + "id": "4e84bba4", + "metadata": {}, + "source": [ + "This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/crewai/gsi/RAG_with_Couchbase_and_CrewAI.ipynb).\n", + "\n", + "You can either:\n", + "- Download the notebook file and run it on [Google Colab](https://colab.research.google.com)\n", + "- Run it on your system by setting up the Python environment" + ] + }, + { + "cell_type": "markdown", + "id": "202801ea", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "id": "55bb6aae", + "metadata": {}, + "source": [ + "### Couchbase Requirements" + ] + }, + { + "cell_type": "markdown", + "id": "d318f572", + "metadata": {}, + "source": [ + "1. Create and Deploy Your Free Tier Operational cluster on [Capella](https://cloud.couchbase.com/sign-up)\n", + " - To get started with [Couchbase Capella](https://cloud.couchbase.com), create an account and use it to deploy a free tier operational cluster\n", + " - This account provides you with an environment where you can explore and learn about Capella\n", + " - To learn more, please follow the [Getting Started Guide](https://docs.couchbase.com/cloud/get-started/create-account.html)\n", + " - **Important**: This tutorial requires Couchbase Server **8.0+** for GSI vector search capabilities" + ] + }, + { + "cell_type": "markdown", + "id": "474a6a23", + "metadata": {}, + "source": [ + "### Couchbase Capella Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "e07c1ff4", + "metadata": {}, + "source": [ + "When running Couchbase using Capella, the following prerequisites need to be met:\n", + "- Create the database credentials to access the required bucket (Read and Write) used in the application\n", + "- Allow access to the Cluster from the IP on which the application is running by following the [Network Security documentation](https://docs.couchbase.com/cloud/security/security.html#public-access)" + ] + }, + { + "cell_type": "markdown", + "id": "b223faba", + "metadata": {}, + "source": [ + "## Setup and Installation" + ] + }, + { + "cell_type": "markdown", + "id": "81251293", + "metadata": {}, + "source": [ + "### Installing Necessary Libraries" + ] + }, + { + "cell_type": "markdown", + "id": "21e51e49", + "metadata": {}, + "source": [ + "We'll install the following key libraries:\n", + "- `datasets`: For loading and managing our training data\n", + "- `langchain-couchbase`: To integrate Couchbase with LangChain for GSI vector storage and caching\n", + "- `langchain-openai`: For accessing OpenAI's embedding and chat models\n", + "- `crewai`: To create and orchestrate our AI agents for RAG operations\n", + "- `python-dotenv`: For securely managing environment variables and API keys\n", + "\n", + "These libraries provide the foundation for building a semantic search engine with GSI vector embeddings, database integration, and agent-based RAG capabilities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a666ce8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet datasets==4.1.0 langchain-couchbase==0.5.0 langchain-openai==0.3.33 crewai==0.186.1 python-dotenv==1.1.1" + ] + }, + { + "cell_type": "markdown", + "id": "e5d980e7", + "metadata": {}, + "source": [ + "### Import Required Modules" + ] + }, + { + "cell_type": "markdown", + "id": "94b2e73b", + "metadata": {}, + "source": [ + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5d013a55", + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "from uuid import uuid4\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.diagnostics import PingState, ServiceType\n", + "from couchbase.exceptions import (InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + " ServiceUnavailableException,\n", + " CouchbaseException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from crewai.tools import tool\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy, IndexType\n", + "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n", + "\n", + "from crewai import Agent, Crew, Process, Task" + ] + }, + { + "cell_type": "markdown", + "id": "fb7d108a", + "metadata": {}, + "source": [ + "### Configure Logging" + ] + }, + { + "cell_type": "markdown", + "id": "a65cf252", + "metadata": {}, + "source": [ + "Logging is configured to track the progress of the script and capture any errors or warnings." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "9e719ffc", + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(\n", + " level=logging.INFO,\n", + " format='%(asctime)s [%(levelname)s] %(message)s',\n", + " datefmt='%Y-%m-%d %H:%M:%S'\n", + ")\n", + "\n", + "# Suppress httpx logging\n", + "logging.getLogger('httpx').setLevel(logging.CRITICAL)" + ] + }, + { + "cell_type": "markdown", + "id": "3690497b", + "metadata": {}, + "source": [ + "### Load Environment Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "653fc54f", + "metadata": {}, + "source": [ + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script uses environment variables to store sensitive information, enhancing the overall security and maintainability of your code by avoiding hardcoded values." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "3aaf9289", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Configuration loaded successfully\n" + ] + } + ], + "source": [ + "# Load environment variables\n", + "load_dotenv(\"./.env\")\n", + "\n", + "# Configuration\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or input(\"Enter your OpenAI API key: \")\n", + "if not OPENAI_API_KEY:\n", + " raise ValueError(\"OPENAI_API_KEY is not set\")\n", + "\n", + "CB_HOST = os.getenv('CB_HOST') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or 'vector-search-testing'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or 'crew'\n", + "\n", + "print(\"Configuration loaded successfully\")" + ] + }, + { + "cell_type": "markdown", + "id": "7fa87d96", + "metadata": {}, + "source": [ + "## Couchbase Connection Setup" + ] + }, + { + "cell_type": "markdown", + "id": "3c30a607", + "metadata": {}, + "source": [ + "### Connect to Cluster" + ] + }, + { + "cell_type": "markdown", + "id": "996466dc", + "metadata": {}, + "source": [ + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "979bd5e7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "# Connect to Couchbase\n", + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " print(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " print(f\"Failed to connect to Couchbase: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "id": "7fc61a9b", + "metadata": {}, + "source": [ + "### Setup Collections" + ] + }, + { + "cell_type": "markdown", + "id": "b78b48da", + "metadata": {}, + "source": [ + "Create and configure Couchbase bucket, scope, and collection for storing our vector data.\n", + "\n", + "1. **Bucket Creation:**\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: If you are using Capella, create a bucket manually called vector-search-testing(or any name you prefer) with the same properties.\n", + "\n", + "2. **Scope Management:** \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. **Collection Setup:**\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "**Additional Tasks:**\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "13b79fa7", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-10-06 10:17:53 [INFO] Bucket 'vector-search-testing' exists.\n", + "2025-10-06 10:17:53 [INFO] Collection 'crew' already exists. Skipping creation.\n", + "2025-10-06 10:17:55 [INFO] All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)" + ] + }, + { + "cell_type": "markdown", + "id": "fa4faf3f", + "metadata": {}, + "source": [ + "## Understanding GSI Vector Search" + ] + }, + { + "cell_type": "markdown", + "id": "69c7d28f", + "metadata": {}, + "source": [ + "### GSI Vector Index Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "90080454", + "metadata": {}, + "source": [ + "Semantic search with GSI requires creating a Global Secondary Index optimized for vector operations. Unlike FTS-based vector search, GSI vector indexes offer two distinct types optimized for different use cases:" + ] + }, + { + "cell_type": "markdown", + "id": "72154198", + "metadata": {}, + "source": [ + "#### GSI Vector Index Types" + ] + }, + { + "cell_type": "markdown", + "id": "b55cb1f4", + "metadata": {}, + "source": [ + "##### Hyperscale Vector Indexes (BHIVE)" + ] + }, + { + "cell_type": "markdown", + "id": "4a80376e", + "metadata": {}, + "source": [ + "- **Best for**: Pure vector searches like content discovery, recommendations, and semantic search\n", + "- **Performance**: High performance with low memory footprint, optimized for concurrent operations\n", + "- **Scalability**: Designed to scale to billions of vectors\n", + "- **Use when**: You primarily perform vector-only queries without complex scalar filtering" + ] + }, + { + "cell_type": "markdown", + "id": "1cbcbf25", + "metadata": {}, + "source": [ + "##### Composite Vector Indexes" + ] + }, + { + "cell_type": "markdown", + "id": "204f1bcc", + "metadata": {}, + "source": [ + "- **Best for**: Filtered vector searches that combine vector search with scalar value filtering\n", + "- **Performance**: Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", + "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- **Note**: Scalar filters take precedence over vector similarity" + ] + }, + { + "cell_type": "markdown", + "id": "5ebefc2f", + "metadata": {}, + "source": [ + "#### Understanding Index Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "93e4bff1", + "metadata": {}, + "source": [ + "The `index_description` parameter controls how Couchbase optimizes vector storage and search through centroids and quantization:\n", + "\n", + "**Format**: `'IVF[],{PQ|SQ}'`\n", + "\n", + "**Centroids (IVF - Inverted File):**\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- More centroids = faster search, slower training \n", + "- Fewer centroids = slower search, faster training\n", + "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "\n", + "**Quantization Options:**\n", + "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- Higher values = better accuracy, larger index size\n", + "\n", + "**Common Examples:**\n", + "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", + "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", + "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "88e4c207", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GSI vector index configuration prepared\n" + ] + } + ], + "source": [ + "# GSI Vector Index Configuration\n", + "# Unlike FTS indexes, GSI vector indexes are created programmatically through the vector store\n", + "# We'll configure the parameters that will be used for index creation\n", + "\n", + "# Vector configuration\n", + "DISTANCE_STRATEGY = DistanceStrategy.COSINE # Cosine similarity\n", + "INDEX_TYPE = IndexType.BHIVE # Using BHIVE for high-performance vector \n", + "INDEX_DESCRIPTION = \"IVF,SQ8\" # Auto-selected centroids with 8-bit scalar quantization\n", + "\n", + "# To create a Composite Index instead, use the following:\n", + "# INDEX_TYPE = IndexType.COMPOSITE # Combines vector search with scalar filtering\n", + "\n", + "print(\"GSI vector index configuration prepared\")" + ] + }, + { + "cell_type": "markdown", + "id": "7de9d300", + "metadata": {}, + "source": [ + "### Alternative: Composite Index Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "601a9d35", + "metadata": {}, + "source": [ + "If your use case requires complex filtering with scalar attributes, you can create a **Composite index** instead by changing the configuration:\n", + "\n", + "```python\n", + "# Alternative configuration for Composite index\n", + "INDEX_TYPE = IndexType.COMPOSITE # Instead of IndexType.BHIVE\n", + "INDEX_DESCRIPTION = \"IVF,SQ8\" # Same quantization settings\n", + "DISTANCE_STRATEGY = DistanceStrategy.COSINE # Same distance metric\n", + "\n", + "# The rest of the setup remains identical\n", + "```\n", + "\n", + "**Use Composite indexes when:**\n", + "- You need to filter by document metadata or attributes before vector similarity\n", + "- Your queries combine vector search with WHERE clauses \n", + "- You have well-defined filtering requirements that can reduce the search space\n", + "\n", + "**Note**: The index creation process is identical - just change the `INDEX_TYPE`. Composite indexes enable pre-filtering with scalar attributes, making them ideal for applications requiring complex query patterns with metadata filtering." + ] + }, + { + "cell_type": "markdown", + "id": "2301cc63", + "metadata": {}, + "source": [ + "## OpenAI Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "99257fdb", + "metadata": {}, + "source": [ + "This section initializes two key OpenAI components needed for our RAG system:\n", + "\n", + "1. **OpenAI Embeddings:**\n", + " - Uses the 'text-embedding-3-small' model\n", + " - Converts text into high-dimensional vector representations (embeddings)\n", + " - These embeddings enable semantic search by capturing the meaning of text\n", + " - Required for vector similarity search in Couchbase\n", + "\n", + "2. **ChatOpenAI Language Model:**\n", + " - Uses the 'gpt-4o' model\n", + " - Temperature set to 0.2 for balanced creativity and focus\n", + " - Serves as the cognitive engine for CrewAI agents\n", + " - Powers agent reasoning, decision-making, and task execution\n", + " - Enables agents to:\n", + " - Process and understand retrieved context from vector search\n", + " - Generate thoughtful responses based on that context\n", + " - Follow instructions defined in agent roles and goals\n", + " - Collaborate with other agents in the crew\n", + " - The relatively low temperature (0.2) ensures agents produce reliable, consistent outputs while maintaining some creative problem-solving ability\n", + "\n", + "Both components require a valid OpenAI API key (OPENAI_API_KEY) for authentication.\n", + "In the CrewAI framework, the LLM acts as the \"brain\" for each agent, allowing them to interpret tasks, retrieve relevant information via the RAG system, and generate appropriate outputs based on their specialized roles and expertise." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "d9e6fd1a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OpenAI components initialized\n" + ] + } + ], + "source": [ + "# Initialize OpenAI components\n", + "embeddings = OpenAIEmbeddings(\n", + " openai_api_key=OPENAI_API_KEY,\n", + " model=\"text-embedding-3-small\"\n", + ")\n", + "\n", + "llm = ChatOpenAI(\n", + " openai_api_key=OPENAI_API_KEY,\n", + " model=\"gpt-4o\",\n", + " temperature=0.2\n", + ")\n", + "\n", + "print(\"OpenAI components initialized\")" + ] + }, + { + "cell_type": "markdown", + "id": "902067c7", + "metadata": {}, + "source": [ + "## Document Processing and Vector Store Setup" + ] + }, + { + "cell_type": "markdown", + "id": "7340c7ce", + "metadata": {}, + "source": [ + "### Create Couchbase GSI Vector Store" + ] + }, + { + "cell_type": "markdown", + "id": "2d202628", + "metadata": {}, + "source": [ + "Set up the GSI vector store where we'll store document embeddings for high-performance semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "a877a51d", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-10-06 10:18:05 [INFO] GSI Vector store setup completed\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GSI Vector store initialized successfully\n" + ] + } + ], + "source": [ + "# Setup GSI vector store with OpenAI embeddings\n", + "try:\n", + " vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " distance_metric=DISTANCE_STRATEGY\n", + " )\n", + " print(\"GSI Vector store initialized successfully\")\n", + " logging.info(\"GSI Vector store setup completed\")\n", + "except Exception as e:\n", + " logging.error(f\"Failed to initialize GSI vector store: {str(e)}\")\n", + " raise RuntimeError(f\"GSI Vector store initialization failed: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "6194a58c", + "metadata": {}, + "source": [ + "### Load BBC News Dataset" + ] + }, + { + "cell_type": "markdown", + "id": "f1dad8d9", + "metadata": {}, + "source": [ + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "13b20d27", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-10-06 10:18:13 [INFO] Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c7592356", + "metadata": {}, + "source": [ + "#### Data Cleaning" + ] + }, + { + "cell_type": "markdown", + "id": "f2ad46f0", + "metadata": {}, + "source": [ + "Remove duplicate articles for cleaner search results." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "496e3afc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "id": "69e3853c", + "metadata": {}, + "source": [ + "#### Save Data to Vector Store" + ] + }, + { + "cell_type": "markdown", + "id": "814d5e49", + "metadata": {}, + "source": [ + "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. **Memory Efficiency**: Processing in smaller batches prevents memory overload\n", + "2. **Error Handling**: If an error occurs, only the current batch is affected\n", + "3. **Progress Tracking**: Easier to monitor and track the ingestion progress\n", + "4. **Resource Management**: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 50 to ensure reliable operation. The optimal batch size depends on many factors including document sizes, available system resources, network conditions, and concurrent workload." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "188dcccd", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-10-06 10:19:43 [INFO] Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 50\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "bd60ee6d", + "metadata": {}, + "source": [ + "## Vector Search Performance Testing" + ] + }, + { + "cell_type": "markdown", + "id": "51df07c1", + "metadata": {}, + "source": [ + "Now let's demonstrate the performance benefits of GSI optimization by testing pure vector search performance. We'll compare three optimization levels:\n", + "\n", + "1. **Baseline Performance**: Vector search without GSI optimization\n", + "2. **Vector Index-Optimized Performance**: Same search with BHIVE GSI index\n", + "3. **Cache Benefits**: Show how caching can be applied on top of GSI for repeated queries\n", + "\n", + "**Important**: This testing focuses on pure vector search performance, isolating the GSI improvements from other workflow overhead." + ] + }, + { + "cell_type": "markdown", + "id": "c9d72167", + "metadata": {}, + "source": [ + "### Create Vector Search Function" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "c43f62fa", + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "# Create GSI vector retriever optimized for high-performance searches\n", + "retriever = vector_store.as_retriever(\n", + " search_type=\"similarity\",\n", + " search_kwargs={\"k\": 4} # Return top 4 most similar documents\n", + ")\n", + "\n", + "def test_vector_search_performance(query_text, label=\"Vector Search\"):\n", + " \"\"\"Test pure vector search performance and return timing metrics\"\"\"\n", + " print(f\"\\n[{label}] Testing vector search performance\")\n", + " print(f\"[{label}] Query: '{query_text}'\")\n", + " \n", + " start_time = time.time()\n", + " \n", + " try:\n", + " # Perform vector search using the retriever\n", + " docs = retriever.invoke(query_text)\n", + " end_time = time.time()\n", + " \n", + " search_time = end_time - start_time\n", + " print(f\"[{label}] Vector search completed in {search_time:.4f} seconds\")\n", + " print(f\"[{label}] Found {len(docs)} relevant documents\")\n", + " \n", + " # Show a preview of the first result\n", + " if docs:\n", + " preview = docs[0].page_content[:100] + \"...\" if len(docs[0].page_content) > 100 else docs[0].page_content\n", + " print(f\"[{label}] Top result preview: {preview}\")\n", + " \n", + " return search_time\n", + " except Exception as e:\n", + " print(f\"[{label}] Vector search failed: {str(e)}\")\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "id": "f939b9e1", + "metadata": {}, + "source": [ + "### Test 1: Baseline Performance (No GSI Index)" + ] + }, + { + "cell_type": "markdown", + "id": "e20d10ad", + "metadata": {}, + "source": [ + "Test pure vector search performance without GSI optimization." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "71ceaa56", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing baseline vector search performance without GSI optimization...\n", + "\n", + "[Baseline Search] Testing vector search performance\n", + "[Baseline Search] Query: 'What are the latest developments in football transfers?'\n", + "[Baseline Search] Vector search completed in 1.3999 seconds\n", + "[Baseline Search] Found 4 relevant documents\n", + "[Baseline Search] Top result preview: The latest updates and analysis from the BBC.\n", + "\n", + "Baseline vector search time (without GSI): 1.3999 seconds\n", + "\n" + ] + } + ], + "source": [ + "# Test baseline vector search performance without GSI index\n", + "test_query = \"What are the latest developments in football transfers?\"\n", + "print(\"Testing baseline vector search performance without GSI optimization...\")\n", + "baseline_time = test_vector_search_performance(test_query, \"Baseline Search\")\n", + "print(f\"\\nBaseline vector search time (without GSI): {baseline_time:.4f} seconds\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "90d304e9", + "metadata": {}, + "source": [ + "### Create BHIVE GSI Index" + ] + }, + { + "cell_type": "markdown", + "id": "6ae9cef0", + "metadata": {}, + "source": [ + "Now let's create a BHIVE GSI vector index to enable high-performance vector searches. The index creation is done programmatically through the vector store, which will optimize the index settings based on our data and requirements." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "389d1358", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Creating BHIVE GSI vector index...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-10-06 10:20:15 [INFO] BHIVE index created with description 'IVF,SQ8'\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GSI Vector index created successfully\n", + "Waiting for index to become available...\n" + ] + } + ], + "source": [ + "# Create GSI Vector Index for high-performance searches\n", + "print(\"Creating BHIVE GSI vector index...\")\n", + "try:\n", + " # Create a BHIVE index optimized for pure vector searches\n", + " vector_store.create_index(\n", + " index_type=INDEX_TYPE, # BHIVE index type\n", + " index_description=INDEX_DESCRIPTION # IVF,SQ8 for optimized performance\n", + " )\n", + " print(f\"GSI Vector index created successfully\")\n", + " logging.info(f\"BHIVE index created with description '{INDEX_DESCRIPTION}'\")\n", + " \n", + " # Wait a moment for index to be available\n", + " print(\"Waiting for index to become available...\")\n", + " time.sleep(5)\n", + " \n", + "except Exception as e:\n", + " # Index might already exist, which is fine\n", + " if \"already exists\" in str(e).lower():\n", + " print(f\"GSI Vector index already exists, proceeding...\")\n", + " logging.info(f\"Index already exists\")\n", + " else:\n", + " logging.error(f\"Failed to create GSI index: {str(e)}\")\n", + " raise RuntimeError(f\"GSI index creation failed: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "6b9e5763", + "metadata": {}, + "source": [ + "### Test 2: Vector Index-Optimized Performance" + ] + }, + { + "cell_type": "markdown", + "id": "8388f41b", + "metadata": {}, + "source": [ + "Test the same vector search with BHIVE GSI optimization." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "b1b89f5b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing vector search performance with BHIVE GSI optimization...\n", + "\n", + "[Vector Index-Optimized Search] Testing vector search performance\n", + "[Vector Index-Optimized Search] Query: 'What are the latest developments in football transfers?'\n", + "[Vector Index-Optimized Search] Vector search completed in 0.5885 seconds\n", + "[Vector Index-Optimized Search] Found 4 relevant documents\n", + "[Vector Index-Optimized Search] Top result preview: Four key areas for Everton's new owners to address\n", + "\n", + "Everton fans last saw silverware in 1995 when th...\n" + ] + } + ], + "source": [ + "# Test vector search performance with GSI index\n", + "print(\"Testing vector search performance with BHIVE GSI optimization...\")\n", + "gsi_search_time = test_vector_search_performance(test_query, \"Vector Index-Optimized Search\")" + ] + }, + { + "cell_type": "markdown", + "id": "eba6c37a", + "metadata": {}, + "source": [ + "### Test 3: Cache Benefits Testing" + ] + }, + { + "cell_type": "markdown", + "id": "1cc73249", + "metadata": {}, + "source": [ + "Now let's demonstrate how caching can improve performance for repeated queries. **Note**: Caching benefits apply to both baseline and GSI-optimized searches." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "3850c8fc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing cache benefits with vector search...\n", + "First execution (cache miss):\n", + "\n", + "[Cache Test - First Run] Testing vector search performance\n", + "[Cache Test - First Run] Query: 'What happened in the latest Premier League matches?'\n", + "[Cache Test - First Run] Vector search completed in 0.6450 seconds\n", + "[Cache Test - First Run] Found 4 relevant documents\n", + "[Cache Test - First Run] Top result preview: Who has made Troy's Premier League team of the week?\n", + "\n", + "After every round of Premier League matches th...\n", + "\n", + "Second execution (cache hit - should be faster):\n", + "\n", + "[Cache Test - Second Run] Testing vector search performance\n", + "[Cache Test - Second Run] Query: 'What happened in the latest Premier League matches?'\n", + "[Cache Test - Second Run] Vector search completed in 0.4306 seconds\n", + "[Cache Test - Second Run] Found 4 relevant documents\n", + "[Cache Test - Second Run] Top result preview: Who has made Troy's Premier League team of the week?\n", + "\n", + "After every round of Premier League matches th...\n" + ] + } + ], + "source": [ + "# Test cache benefits with a different query to avoid interference\n", + "cache_test_query = \"What happened in the latest Premier League matches?\"\n", + "\n", + "print(\"Testing cache benefits with vector search...\")\n", + "print(\"First execution (cache miss):\")\n", + "cache_time_1 = test_vector_search_performance(cache_test_query, \"Cache Test - First Run\")\n", + "\n", + "print(\"\\nSecond execution (cache hit - should be faster):\")\n", + "cache_time_2 = test_vector_search_performance(cache_test_query, \"Cache Test - Second Run\")" + ] + }, + { + "cell_type": "markdown", + "id": "21530f7b", + "metadata": {}, + "source": [ + "### Vector Search Performance Analysis" + ] + }, + { + "cell_type": "markdown", + "id": "009e69c4", + "metadata": {}, + "source": [ + "Let's analyze the vector search performance improvements across all optimization levels:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "388ca617", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "================================================================================\n", + "VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", + "================================================================================\n", + "Phase 1 - Baseline Search (No GSI): 1.3999 seconds\n", + "Phase 2 - Vector Index-Optimized Search: 0.5885 seconds\n", + "Phase 3 - Cache Benefits:\n", + " First execution (cache miss): 0.6450 seconds\n", + " Second execution (cache hit): 0.4306 seconds\n", + "\n", + "--------------------------------------------------------------------------------\n", + "VECTOR SEARCH OPTIMIZATION IMPACT:\n", + "--------------------------------------------------------------------------------\n", + "GSI Index Benefit: 2.38x faster (58.0% improvement)\n", + "Cache Benefit: 1.50x faster (33.2% improvement)\n", + "\n", + "Key Insights for Vector Search Performance:\n", + "\u2022 GSI BHIVE indexes provide significant performance improvements for vector similarity search\n", + "\u2022 Performance gains are most dramatic for complex semantic queries\n", + "\u2022 BHIVE optimization is particularly effective for high-dimensional embeddings\n", + "\u2022 Combined with proper quantization (SQ8), GSI delivers production-ready performance\n", + "\u2022 These performance improvements directly benefit any application using the vector store\n" + ] + } + ], + "source": [ + "print(\"\\n\" + \"=\"*80)\n", + "print(\"VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", + "print(\"=\"*80)\n", + "\n", + "print(f\"Phase 1 - Baseline Search (No GSI): {baseline_time:.4f} seconds\")\n", + "print(f\"Phase 2 - Vector Index-Optimized Search: {gsi_search_time:.4f} seconds\")\n", + "if cache_time_1 and cache_time_2:\n", + " print(f\"Phase 3 - Cache Benefits:\")\n", + " print(f\" First execution (cache miss): {cache_time_1:.4f} seconds\")\n", + " print(f\" Second execution (cache hit): {cache_time_2:.4f} seconds\")\n", + "\n", + "print(\"\\n\" + \"-\"*80)\n", + "print(\"VECTOR SEARCH OPTIMIZATION IMPACT:\")\n", + "print(\"-\"*80)\n", + "\n", + "# GSI improvement analysis\n", + "if baseline_time and gsi_search_time:\n", + " speedup = baseline_time / gsi_search_time if gsi_search_time > 0 else float('inf')\n", + " time_saved = baseline_time - gsi_search_time\n", + " percent_improvement = (time_saved / baseline_time) * 100\n", + " print(f\"GSI Index Benefit: {speedup:.2f}x faster ({percent_improvement:.1f}% improvement)\")\n", + "\n", + "# Cache improvement analysis\n", + "if cache_time_1 and cache_time_2 and cache_time_2 < cache_time_1:\n", + " cache_speedup = cache_time_1 / cache_time_2\n", + " cache_improvement = ((cache_time_1 - cache_time_2) / cache_time_1) * 100\n", + " print(f\"Cache Benefit: {cache_speedup:.2f}x faster ({cache_improvement:.1f}% improvement)\")\n", + "else:\n", + " print(f\"Cache Benefit: Variable (depends on query complexity and caching mechanism)\")\n", + "\n", + "print(f\"\\nKey Insights for Vector Search Performance:\")\n", + "print(f\"\u2022 GSI BHIVE indexes provide significant performance improvements for vector similarity search\")\n", + "print(f\"\u2022 Performance gains are most dramatic for complex semantic queries\")\n", + "print(f\"\u2022 BHIVE optimization is particularly effective for high-dimensional embeddings\")\n", + "print(f\"\u2022 Combined with proper quantization (SQ8), GSI delivers production-ready performance\")\n", + "print(f\"\u2022 These performance improvements directly benefit any application using the vector store\")" + ] + }, + { + "cell_type": "markdown", + "id": "c7252b0c", + "metadata": {}, + "source": [ + "## CrewAI Agent Setup" + ] + }, + { + "cell_type": "markdown", + "id": "812ee93f", + "metadata": {}, + "source": [ + "### What is CrewAI?" + ] + }, + { + "cell_type": "markdown", + "id": "274671b5", + "metadata": {}, + "source": [ + "Now that we've optimized our vector search performance, let's build a sophisticated agent-based RAG system using CrewAI. CrewAI enables us to create specialized AI agents that collaborate to handle different aspects of the RAG workflow:\n", + "\n", + "- **Research Agent**: Finds and analyzes relevant documents using our optimized vector search\n", + "- **Writer Agent**: Takes research findings and creates polished, structured responses\n", + "- **Collaborative Workflow**: Agents work together, with the writer building on the researcher's findings\n", + "\n", + "This multi-agent approach produces higher-quality responses than single-agent systems by separating research and writing expertise, while benefiting from the GSI performance improvements we just demonstrated." + ] + }, + { + "cell_type": "markdown", + "id": "2bda0d68", + "metadata": {}, + "source": [ + "### Create Vector Search Tool" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "c7b379d0", + "metadata": {}, + "outputs": [], + "source": [ + "# Define the GSI vector search tool using the @tool decorator\n", + "@tool(\"gsi_vector_search\")\n", + "def search_tool(query: str) -> str:\n", + " \"\"\"Search for relevant documents using Hyperscale and Composite Vector Indexes vector similarity.\n", + " Input should be a simple text query string.\n", + " Returns a list of relevant document contents from GSI vector search.\n", + " Use this tool to find detailed information about topics using high-performance GSI indexes.\"\"\"\n", + " \n", + " # Invoke the GSI vector retriever (now optimized with BHIVE index)\n", + " docs = retriever.invoke(query)\n", + "\n", + " # Format the results with distance information\n", + " formatted_docs = \"\\n\\n\".join([\n", + " f\"Document {i+1}:\\n{'-'*40}\\n{doc.page_content}\"\n", + " for i, doc in enumerate(docs)\n", + " ])\n", + " return formatted_docs" + ] + }, + { + "cell_type": "markdown", + "id": "4a2a0165", + "metadata": {}, + "source": [ + "### Create CrewAI Agents" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "73c44437", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CrewAI agents created successfully with optimized GSI vector search\n" + ] + } + ], + "source": [ + "# Create research agent\n", + "researcher = Agent(\n", + " role='Research Expert',\n", + " goal='Find and analyze the most relevant documents to answer user queries accurately',\n", + " backstory=\"\"\"You are an expert researcher with deep knowledge in information retrieval \n", + " and analysis. Your expertise lies in finding, evaluating, and synthesizing information \n", + " from various sources. You have a keen eye for detail and can identify key insights \n", + " from complex documents. You always verify information across multiple sources and \n", + " provide comprehensive, accurate analyses.\"\"\",\n", + " tools=[search_tool],\n", + " llm=llm,\n", + " verbose=False,\n", + " memory=True,\n", + " allow_delegation=False\n", + ")\n", + "\n", + "# Create writer agent\n", + "writer = Agent(\n", + " role='Technical Writer',\n", + " goal='Generate clear, accurate, and well-structured responses based on research findings',\n", + " backstory=\"\"\"You are a skilled technical writer with expertise in making complex \n", + " information accessible and engaging. You excel at organizing information logically, \n", + " explaining technical concepts clearly, and creating well-structured documents. You \n", + " ensure all information is properly cited, accurate, and presented in a user-friendly \n", + " manner. You have a talent for maintaining the reader's interest while conveying \n", + " detailed technical information.\"\"\",\n", + " llm=llm,\n", + " verbose=False,\n", + " memory=True,\n", + " allow_delegation=False\n", + ")\n", + "\n", + "print(\"CrewAI agents created successfully with optimized GSI vector search\")" + ] + }, + { + "cell_type": "markdown", + "id": "a63dbf3d", + "metadata": {}, + "source": [ + "### How the Optimized RAG Workflow Works" + ] + }, + { + "cell_type": "markdown", + "id": "12bd1697", + "metadata": {}, + "source": [ + "The complete optimized RAG process:\n", + "1. **User Query** \u2192 Research Agent\n", + "2. **Vector Search** \u2192 GSI BHIVE index finds similar documents (now with proven performance improvements)\n", + "3. **Document Analysis** \u2192 Research Agent analyzes and synthesizes findings\n", + "4. **Response Writing** \u2192 Writer Agent creates polished, structured response\n", + "5. **Final Output** \u2192 User receives comprehensive, well-formatted answer\n", + "\n", + "**Key Benefit**: The vector search performance improvements we demonstrated directly enhance the agent workflow efficiency." + ] + }, + { + "cell_type": "markdown", + "id": "5ca6cc10", + "metadata": {}, + "source": [ + "## CrewAI Agent Demo" + ] + }, + { + "cell_type": "markdown", + "id": "3e7d956a", + "metadata": {}, + "source": [ + "Now let's demonstrate the complete optimized agent-based RAG system in action, benefiting from the GSI performance improvements we validated earlier." + ] + }, + { + "cell_type": "markdown", + "id": "45b1a283", + "metadata": {}, + "source": [ + "### Demo Function" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "2176b29d", + "metadata": {}, + "outputs": [], + "source": [ + "def process_interactive_query(query, researcher, writer):\n", + " \"\"\"Run complete RAG workflow with CrewAI agents using optimized GSI vector search\"\"\"\n", + " print(f\"\\nProcessing Query: {query}\")\n", + " print(\"=\" * 80)\n", + " \n", + " # Create tasks\n", + " research_task = Task(\n", + " description=f\"Research and analyze information relevant to: {query}\",\n", + " agent=researcher,\n", + " expected_output=\"A detailed analysis with key findings\"\n", + " )\n", + " \n", + " writing_task = Task(\n", + " description=\"Create a comprehensive response\",\n", + " agent=writer,\n", + " expected_output=\"A clear, well-structured answer\",\n", + " context=[research_task]\n", + " )\n", + " \n", + " # Execute crew\n", + " crew = Crew(\n", + " agents=[researcher, writer],\n", + " tasks=[research_task, writing_task],\n", + " process=Process.sequential,\n", + " verbose=True,\n", + " cache=True,\n", + " planning=True\n", + " )\n", + " \n", + " try:\n", + " start_time = time.time()\n", + " result = crew.kickoff()\n", + " elapsed_time = time.time() - start_time\n", + " \n", + " print(f\"\\nCompleted in {elapsed_time:.2f} seconds\")\n", + " print(\"=\" * 80)\n", + " print(\"RESPONSE\")\n", + " print(\"=\" * 80)\n", + " print(result)\n", + " \n", + " return elapsed_time\n", + " except Exception as e:\n", + " print(f\"Error: {str(e)}\")\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "id": "a65e896e", + "metadata": {}, + "source": [ + "### Run Agent-Based RAG Demo" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d355751a", + "metadata": {}, + "outputs": [], + "source": [ + "# Disable logging for cleaner output\n", + "logging.disable(logging.CRITICAL)\n", + "\n", + "# Run demo with a sample query\n", + "demo_query = \"What are the key details about the FA Cup third round draw?\"\n", + "final_time = process_interactive_query(demo_query, researcher, writer)\n", + "\n", + "if final_time:\n", + " print(f\"\\n\\n\u2705 CrewAI agent demo completed successfully in {final_time:.2f} seconds\")" + ] + }, + { + "cell_type": "markdown", + "id": "0d4a24b3", + "metadata": {}, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "82ad950f", + "metadata": {}, + "source": [ + "You have successfully built a powerful agent-based RAG system that combines Couchbase's high-performance GSI vector storage capabilities with CrewAI's multi-agent architecture. This tutorial demonstrated the complete pipeline from data ingestion to intelligent response generation, with real performance benchmarks showing the dramatic improvements GSI indexing provides." + ] + } + ], + "metadata": { + "jupytext": { + "cell_metadata_filter": "-all", + "main_language": "python", + "notebook_metadata_filter": "-all" + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/crewai/gsi/frontmatter.md b/crewai/query_based/frontmatter.md similarity index 100% rename from crewai/gsi/frontmatter.md rename to crewai/query_based/frontmatter.md diff --git a/crewai/fts/.env.sample b/crewai/search_based/.env.sample similarity index 100% rename from crewai/fts/.env.sample rename to crewai/search_based/.env.sample diff --git a/crewai/search_based/RAG_with_Couchbase_and_CrewAI.ipynb b/crewai/search_based/RAG_with_Couchbase_and_CrewAI.ipynb new file mode 100644 index 00000000..736a8c81 --- /dev/null +++ b/crewai/search_based/RAG_with_Couchbase_and_CrewAI.ipynb @@ -0,0 +1,1464 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction\n", + "\n", + "In this guide, we will walk you through building a powerful semantic search engine using [Couchbase](https://www.couchbase.com) as the backend database and [CrewAI](https://github.com/crewAIInc/crewAI) for agent-based RAG operations. CrewAI allows us to create specialized agents that can work together to handle different aspects of the RAG workflow, from document retrieval to response generation. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-crewai-couchbase-rag-with-hyperscale-or-composite-vector-index)\n", + "\n", + "How to run this tutorial\n", + "----------------------\n", + "This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run \n", + "interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/crewai/fts/RAG_with_Couchbase_and_CrewAI.ipynb).\n", + "\n", + "You can either:\n", + "- Download the notebook file and run it on [Google Colab](https://colab.research.google.com)\n", + "- Run it on your system by setting up the Python environment\n", + "\n", + "Before you start\n", + "---------------\n", + "\n", + "1. Create and Deploy Your Free Tier Operational cluster on [Capella](https://cloud.couchbase.com/sign-up)\n", + " - To get started with [Couchbase Capella](https://cloud.couchbase.com), create an account and use it to deploy \n", + " a forever free tier operational cluster\n", + " - This account provides you with an environment where you can explore and learn \n", + " about Capella with no time constraint\n", + " - To learn more, please follow the [Getting Started Guide](https://docs.couchbase.com/cloud/get-started/create-account.html)\n", + "\n", + "2. Couchbase Capella Configuration\n", + " When running Couchbase using Capella, the following prerequisites need to be met:\n", + " - Create the database credentials to access the required bucket (Read and Write) used in the application\n", + " - Allow access to the Cluster from the IP on which the application is running by following the [Network Security documentation](https://docs.couchbase.com/cloud/security/security.html#public-access)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "\n", + "We'll install the following key libraries:\n", + "- `datasets`: For loading and managing our training data\n", + "- `langchain-couchbase`: To integrate Couchbase with LangChain for vector storage and caching\n", + "- `langchain-openai`: For accessing OpenAI's embedding and chat models\n", + "- `crewai`: To create and orchestrate our AI agents for RAG operations\n", + "- `python-dotenv`: For securely managing environment variables and API keys\n", + "\n", + "These libraries provide the foundation for building a semantic search engine with vector embeddings, \n", + "database integration, and agent-based RAG capabilities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet datasets==4.1.0 langchain-couchbase==0.4.0 langchain-openai==0.3.33 crewai==0.186.1 python-dotenv==1.1.1 ipywidgets" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.diagnostics import PingState, ServiceType\n", + "from couchbase.exceptions import (InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + " ServiceUnavailableException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from crewai.tools import tool\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", + "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n", + "\n", + "from crewai import Agent, Crew, Process, Task" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(\n", + " level=logging.INFO,\n", + " format='%(asctime)s [%(levelname)s] %(message)s',\n", + " datefmt='%Y-%m-%d %H:%M:%S'\n", + ")\n", + "\n", + "# Suppress httpx logging\n", + "logging.getLogger('httpx').setLevel(logging.CRITICAL)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Loading Sensitive Information\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script uses environment variables to store sensitive information, enhancing the overall security and maintainability of your code by avoiding hardcoded values." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Configuration loaded successfully\n" + ] + } + ], + "source": [ + "# Load environment variables\n", + "load_dotenv(\"./.env\")\n", + "\n", + "# Configuration\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or input(\"Enter your OpenAI API key: \")\n", + "if not OPENAI_API_KEY:\n", + " raise ValueError(\"OPENAI_API_KEY is not set\")\n", + "\n", + "CB_HOST = os.getenv('CB_HOST') or input(\"Enter Couchbase host (default: couchbase://localhost): \") or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input(\"Enter Couchbase username (default: Administrator): \") or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass(\"Enter Couchbase password (default: password): \") or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input(\"Enter bucket name (default: vector-search-testing): \") or 'vector-search-testing'\n", + "INDEX_NAME = os.getenv('INDEX_NAME') or input(\"Enter index name (default: vector_search_crew): \") or 'vector_search_crew'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input(\"Enter scope name (default: shared): \") or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input(\"Enter collection name (default: crew): \") or 'crew'\n", + "\n", + "print(\"Configuration loaded successfully\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "# Connect to Couchbase\n", + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " print(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " print(f\"Failed to connect to Couchbase: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Verifying Search Service Availability\n", + " In this section, we verify that the Couchbase Search (FTS) service is available and responding correctly. This is a crucial check because our vector search functionality depends on it. If any issues are detected with the Search service, the function will raise an exception, allowing us to catch and handle problems early before attempting vector operations.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Search service is responding at: 18.117.138.157:18094\n", + "Search service check passed successfully\n" + ] + } + ], + "source": [ + "def check_search_service(cluster):\n", + " \"\"\"Verify search service availability using ping\"\"\"\n", + " try:\n", + " # Get ping result\n", + " ping_result = cluster.ping()\n", + " search_available = False\n", + " \n", + " # Check if search service is responding\n", + " for service_type, endpoints in ping_result.endpoints.items():\n", + " if service_type == ServiceType.Search:\n", + " for endpoint in endpoints:\n", + " if endpoint.state == PingState.OK:\n", + " search_available = True\n", + " print(f\"Search service is responding at: {endpoint.remote}\")\n", + " break\n", + " break\n", + "\n", + " if not search_available:\n", + " raise RuntimeError(\"Search/FTS service not found or not responding\")\n", + " \n", + " print(\"Search service check passed successfully\")\n", + " except Exception as e:\n", + " print(f\"Health check failed: {str(e)}\")\n", + " raise\n", + "try:\n", + " check_search_service(cluster)\n", + "except Exception as e:\n", + " print(f\"Failed to check search service: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: If you are using Capella, create a bucket manually called vector-search-testing(or any name you prefer) with the same properties.\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Creates primary index on collection for query performance\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 14:34:30 [INFO] Bucket 'vector-search-testing' exists.\n", + "2025-09-17 14:34:32 [INFO] Scope 'shared' does not exist. Creating it...\n", + "2025-09-17 14:34:33 [INFO] Scope 'shared' created successfully.\n", + "2025-09-17 14:34:34 [INFO] Collection 'crew' does not exist. Creating it...\n", + "2025-09-17 14:34:36 [INFO] Collection 'crew' created successfully.\n", + "2025-09-17 14:34:41 [INFO] Primary index present or created successfully.\n", + "2025-09-17 14:34:43 [INFO] All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Ensure primary index exists\n", + " try:\n", + " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", + " logging.info(\"Primary index present or created successfully.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error creating primary index: {str(e)}\")\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Configuring and Initializing Couchbase Vector Search Index for Semantic Document Retrieval\n", + "\n", + "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase Vector Search Index comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", + "\n", + "This CrewAI vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `crew`. The configuration is set up for vectors with exactly `1536 dimensions`, using `dot product` similarity and optimized for `recall`. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", + "\n", + "For more information on creating a vector search index, please follow the instructions at [Couchbase Vector Search Documentation](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# Load index definition\n", + "try:\n", + " with open('crew_index.json', 'r') as file:\n", + " index_definition = json.load(file)\n", + "except FileNotFoundError as e:\n", + " print(f\"Error: crew_index.json file not found: {str(e)}\")\n", + " raise\n", + "except json.JSONDecodeError as e:\n", + " print(f\"Error: Invalid JSON in crew_index.json: {str(e)}\")\n", + " raise\n", + "except Exception as e:\n", + " print(f\"Error loading index definition: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Creating or Updating Search Indexes\n", + "\n", + "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 14:34:47 [INFO] Creating new index 'vector_search_crew'...\n", + "2025-09-17 14:34:48 [INFO] Index 'vector_search_crew' successfully created/updated.\n" + ] + } + ], + "source": [ + "try:\n", + " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", + "\n", + " # Check if index already exists\n", + " existing_indexes = scope_index_manager.get_all_indexes()\n", + " index_name = index_definition[\"name\"]\n", + "\n", + " if index_name in [index.name for index in existing_indexes]:\n", + " logging.info(f\"Index '{index_name}' found\")\n", + " else:\n", + " logging.info(f\"Creating new index '{index_name}'...\")\n", + "\n", + " # Create SearchIndex object from JSON definition\n", + " search_index = SearchIndex.from_json(index_definition)\n", + "\n", + " # Upsert the index (create if not exists, update if exists)\n", + " scope_index_manager.upsert_index(search_index)\n", + " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", + "\n", + "except QueryIndexAlreadyExistsException:\n", + " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", + "except ServiceUnavailableException:\n", + " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", + "except InternalServerFailureException as e:\n", + " logging.error(f\"Internal server error: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up OpenAI Components\n", + "\n", + "This section initializes two key OpenAI components needed for our RAG system:\n", + "\n", + "1. OpenAI Embeddings:\n", + " - Uses the 'text-embedding-3-small' model\n", + " - Converts text into high-dimensional vector representations (embeddings)\n", + " - These embeddings enable semantic search by capturing the meaning of text\n", + " - Required for vector similarity search in Couchbase\n", + "\n", + "2. ChatOpenAI Language Model:\n", + " - Uses the 'gpt-4o' model\n", + " - Temperature set to 0.2 for balanced creativity and focus\n", + " - Serves as the cognitive engine for CrewAI agents\n", + " - Powers agent reasoning, decision-making, and task execution\n", + " - Enables agents to:\n", + " - Process and understand retrieved context from vector search\n", + " - Generate thoughtful responses based on that context\n", + " - Follow instructions defined in agent roles and goals\n", + " - Collaborate with other agents in the crew\n", + " - The relatively low temperature (0.2) ensures agents produce reliable,\n", + " consistent outputs while maintaining some creative problem-solving ability\n", + "\n", + "Both components require a valid OpenAI API key (OPENAI_API_KEY) for authentication.\n", + "In the CrewAI framework, the LLM acts as the \"brain\" for each agent, allowing them\n", + "to interpret tasks, retrieve relevant information via the RAG system, and generate\n", + "appropriate outputs based on their specialized roles and expertise." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OpenAI components initialized\n" + ] + } + ], + "source": [ + "# Initialize OpenAI components\n", + "embeddings = OpenAIEmbeddings(\n", + " openai_api_key=OPENAI_API_KEY,\n", + " model=\"text-embedding-3-small\"\n", + ")\n", + "\n", + "llm = ChatOpenAI(\n", + " openai_api_key=OPENAI_API_KEY,\n", + " model=\"gpt-4o\",\n", + " temperature=0.2\n", + ")\n", + "\n", + "print(\"OpenAI components initialized\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting Up the Couchbase Vector Store\n", + "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Vector store initialized\n" + ] + } + ], + "source": [ + "# Setup vector store\n", + "vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " index_name=INDEX_NAME,\n", + ")\n", + "print(\"Vector store initialized\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 14:35:10 [INFO] Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Error Handling: If an error occurs, only the current batch is affected\n", + "3. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "4. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 50 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 14:36:58 [INFO] Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 50\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating a Vector Search Tool\n", + "After loading our data into the vector store, we need to create a tool that can efficiently search through these vector embeddings. This involves two key components:\n", + "\n", + "### Vector Retriever\n", + "The vector retriever is configured to perform similarity searches. This creates a retriever that performs semantic similarity searches against our vector database. The similarity search finds documents whose vector embeddings are closest to the query's embedding in the vector space.\n", + "\n", + "### Search Tool\n", + "The search tool wraps the retriever in a user-friendly interface that:\n", + "- Takes a query string as input\n", + "- Passes the query to the retriever to find relevant documents\n", + "- Formats the results with clear document separation using document numbers and dividers\n", + "- Returns the formatted results as a single string with each document clearly delineated\n", + "\n", + "The tool is designed to integrate seamlessly with our AI agents, providing them with reliable access to our knowledge base through vector similarity search. The lambda function in the tool handles both direct string queries and structured query objects, ensuring flexibility in how the tool can be invoked.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "# Create vector retriever\n", + "retriever = vector_store.as_retriever(\n", + " search_type=\"similarity\",\n", + ")\n", + "\n", + "# Define the search tool using the @tool decorator\n", + "@tool(\"vector_search\")\n", + "def search_tool(query: str) -> str:\n", + " \"\"\"Search for relevant documents using vector similarity.\n", + " Input should be a simple text query string.\n", + " Returns a list of relevant document contents.\n", + " Use this tool to find detailed information about topics.\"\"\"\n", + " # Handle potential non-string query input if needed (similar to original lambda)\n", + " # CrewAI usually passes the string directly based on task description\n", + " # but checking doesn't hurt, though the Agent logic might handle this.\n", + " # query_str = query if isinstance(query, str) else str(query.get('query', '')) # Simplified for now\n", + "\n", + " # Invoke the retriever\n", + " docs = retriever.invoke(query)\n", + "\n", + " # Format the results\n", + " formatted_docs = \"\\n\\n\".join([\n", + " f\"Document {i+1}:\\n{'-'*40}\\n{doc.page_content}\"\n", + " for i, doc in enumerate(docs)\n", + " ])\n", + " return formatted_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Creating CrewAI Agents\n", + "\n", + "We'll create two specialized AI agents using the CrewAI framework to handle different aspects of our information retrieval and analysis system:\n", + "\n", + "## Research Expert Agent\n", + "This agent is designed to:\n", + "- Execute semantic searches using our vector store\n", + "- Analyze and evaluate search results \n", + "- Identify key information and insights\n", + "- Verify facts across multiple sources\n", + "- Synthesize findings into comprehensive research summaries\n", + "\n", + "## Technical Writer Agent \n", + "This agent is responsible for:\n", + "- Taking research findings and structuring them logically\n", + "- Converting technical concepts into clear explanations\n", + "- Ensuring proper citation and attribution\n", + "- Maintaining engaging yet informative tone\n", + "- Producing well-formatted final outputs\n", + "\n", + "The agents work together in a coordinated way:\n", + "1. Research agent finds and analyzes relevant documents\n", + "2. Writer agent takes those findings and crafts polished responses\n", + "3. Both agents use a custom response template for consistent output\n", + "\n", + "This multi-agent approach allows us to:\n", + "- Leverage specialized expertise for different tasks\n", + "- Maintain high quality through separation of concerns\n", + "- Create more comprehensive and reliable outputs\n", + "- Scale the system's capabilities efficiently" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Agents created successfully\n" + ] + } + ], + "source": [ + "# Custom response template\n", + "response_template = \"\"\"\n", + "Analysis Results\n", + "===============\n", + "{%- if .Response %}\n", + "{{ .Response }}\n", + "{%- endif %}\n", + "\n", + "Sources\n", + "=======\n", + "{%- for tool in .Tools %}\n", + "* {{ tool.name }}\n", + "{%- endfor %}\n", + "\n", + "Metadata\n", + "========\n", + "* Confidence: {{ .Confidence }}\n", + "* Analysis Time: {{ .ExecutionTime }}\n", + "\"\"\"\n", + "\n", + "# Create research agent\n", + "researcher = Agent(\n", + " role='Research Expert',\n", + " goal='Find and analyze the most relevant documents to answer user queries accurately',\n", + " backstory=\"\"\"You are an expert researcher with deep knowledge in information retrieval \n", + " and analysis. Your expertise lies in finding, evaluating, and synthesizing information \n", + " from various sources. You have a keen eye for detail and can identify key insights \n", + " from complex documents. You always verify information across multiple sources and \n", + " provide comprehensive, accurate analyses.\"\"\",\n", + " tools=[search_tool],\n", + " llm=llm,\n", + " verbose=True,\n", + " memory=True,\n", + " allow_delegation=False,\n", + " response_template=response_template\n", + ")\n", + "\n", + "# Create writer agent\n", + "writer = Agent(\n", + " role='Technical Writer',\n", + " goal='Generate clear, accurate, and well-structured responses based on research findings',\n", + " backstory=\"\"\"You are a skilled technical writer with expertise in making complex \n", + " information accessible and engaging. You excel at organizing information logically, \n", + " explaining technical concepts clearly, and creating well-structured documents. You \n", + " ensure all information is properly cited, accurate, and presented in a user-friendly \n", + " manner. You have a talent for maintaining the reader's interest while conveying \n", + " detailed technical information.\"\"\",\n", + " llm=llm,\n", + " verbose=True,\n", + " memory=True,\n", + " allow_delegation=False,\n", + " response_template=response_template\n", + ")\n", + "\n", + "print(\"Agents created successfully\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How CrewAI Agents Work in this RAG System\n", + "\n", + "### Agent-Based RAG Architecture\n", + "\n", + "This system uses a two-agent approach to implement Retrieval-Augmented Generation (RAG):\n", + "\n", + "1. **Research Expert Agent**:\n", + " - Receives the user query\n", + " - Uses the vector search tool to retrieve relevant documents from Couchbase\n", + " - Analyzes and synthesizes information from retrieved documents\n", + " - Produces a comprehensive research summary with key findings\n", + "\n", + "2. **Technical Writer Agent**:\n", + " - Takes the research summary as input\n", + " - Structures and formats the information\n", + " - Creates a polished, user-friendly response\n", + " - Ensures proper attribution and citation\n", + "\n", + "#### How the Process Works:\n", + "\n", + "1. **Query Processing**: User query is passed to the Research Agent\n", + "2. **Vector Search**: Query is converted to embeddings and matched against document vectors\n", + "3. **Document Retrieval**: Most similar documents are retrieved from Couchbase\n", + "4. **Analysis**: Research Agent analyzes documents for relevance and extracts key information\n", + "5. **Synthesis**: Research Agent combines findings into a coherent summary\n", + "6. **Refinement**: Writer Agent restructures and enhances the content\n", + "7. **Response Generation**: Final polished response is returned to the user\n", + "\n", + "This multi-agent approach separates concerns (research vs. writing) and leverages\n", + "specialized expertise for each task, resulting in higher quality responses.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Testing the Search System\n", + "\n", + "Test the system with some example queries." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "def process_query(query, researcher, writer):\n", + " \"\"\"\n", + " Test the complete RAG system with a user query.\n", + " \n", + " This function tests both the vector search capability and the agent-based processing:\n", + " 1. Vector search: Retrieves relevant documents from Couchbase\n", + " 2. Agent processing: Uses CrewAI agents to analyze and format the response\n", + " \n", + " The function measures performance and displays detailed outputs from each step.\n", + " \"\"\"\n", + " print(f\"\\nQuery: {query}\")\n", + " print(\"-\" * 80)\n", + " \n", + " # Create tasks\n", + " research_task = Task(\n", + " description=f\"Research and analyze information relevant to: {query}\",\n", + " agent=researcher,\n", + " expected_output=\"A detailed analysis with key findings and supporting evidence\"\n", + " )\n", + " \n", + " writing_task = Task(\n", + " description=\"Create a comprehensive and well-structured response\",\n", + " agent=writer,\n", + " expected_output=\"A clear, comprehensive response that answers the query\",\n", + " context=[research_task]\n", + " )\n", + " \n", + " # Create and execute crew\n", + " crew = Crew(\n", + " agents=[researcher, writer],\n", + " tasks=[research_task, writing_task],\n", + " process=Process.sequential,\n", + " verbose=True,\n", + " cache=True,\n", + " planning=True\n", + " )\n", + " \n", + " try:\n", + " start_time = time.time()\n", + " result = crew.kickoff()\n", + " elapsed_time = time.time() - start_time\n", + " \n", + " print(f\"\\nQuery completed in {elapsed_time:.2f} seconds\")\n", + " print(\"=\" * 80)\n", + " print(\"RESPONSE\")\n", + " print(\"=\" * 80)\n", + " print(result)\n", + " \n", + " if hasattr(result, 'tasks_output'):\n", + " print(\"\\n\" + \"=\" * 80)\n", + " print(\"DETAILED TASK OUTPUTS\")\n", + " print(\"=\" * 80)\n", + " for task_output in result.tasks_output:\n", + " print(f\"\\nTask: {task_output.description[:100]}...\")\n", + " print(\"-\" * 40)\n", + " print(f\"Output: {task_output.raw}\")\n", + " print(\"-\" * 40)\n", + " except Exception as e:\n", + " print(f\"Error executing crew: {str(e)}\")\n", + " logging.error(f\"Crew execution failed: {str(e)}\", exc_info=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query: What are the key details about the FA Cup third round draw? Include information about Manchester United vs Arsenal, Tamworth vs Tottenham, and other notable fixtures.\n", + "--------------------------------------------------------------------------------\n" + ] + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Crew Execution Started \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Crew Execution Started                                                                                         \u2502\n",
+       "\u2502  Name: crew                                                                                                     \u2502\n",
+       "\u2502  ID: 02c49af6-ffe5-4bea-8cba-f3f08049625d                                                                       \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[36m\u256d\u2500\u001b[0m\u001b[36m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[36m Crew Execution Started \u001b[0m\u001b[36m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[36m\u2500\u256e\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[1;36mCrew Execution Started\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[36mcrew\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mID: \u001b[0m\u001b[36m02c49af6-ffe5-4bea-8cba-f3f08049625d\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2502\u001b[0m \u001b[36m\u2502\u001b[0m\n", + "\u001b[36m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[1m\u001b[93m \n", + "[2025-09-17 14:36:58][INFO]: Planning the crew execution\u001b[00m\n", + "[EventBus Error] Handler 'on_task_started' failed for event 'TaskStartedEvent': 'NoneType' object has no attribute 'key'\n" + ] + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Task Completion \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task Completed                                                                                                 \u2502\n",
+       "\u2502  Name: 5d4df0c5-14ad-47d7-8412-2cb8438a65df                                                                     \u2502\n",
+       "\u2502  Agent: Task Execution Planner                                                                                  \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m5d4df0c5-14ad-47d7-8412-2cb8438a65df\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mTask Execution Planner\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 \ud83d\udd27 Agent Tool Execution \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Agent: Research Expert                                                                                         \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Thought: Thought: To gather detailed information about the FA Cup third round draw, specifically focusing on   \u2502\n",
+       "\u2502  the matches Manchester United vs Arsenal and Tamworth vs Tottenham, I will perform a vector search using a     \u2502\n",
+       "\u2502  relevant query.                                                                                                \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Using Tool: vector_search                                                                                      \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[35m\u256d\u2500\u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m \ud83d\udd27 Agent Tool Execution \u001b[0m\u001b[35m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[35m\u2500\u256e\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[1;92mResearch Expert\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mThought: \u001b[0m\u001b[92mThought: To gather detailed information about the FA Cup third round draw, specifically focusing on \u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[92mthe matches Manchester United vs Arsenal and Tamworth vs Tottenham, I will perform a vector search using a \u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[92mrelevant query.\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[37mUsing Tool: \u001b[0m\u001b[1;92mvector_search\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2502\u001b[0m \u001b[35m\u2502\u001b[0m\n", + "\u001b[35m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Tool Input \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  \"{\\\"query\\\": \\\"FA Cup third round draw Manchester United vs Arsenal Tamworth vs Tottenham\\\"}\"                  \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[34m\u256d\u2500\u001b[0m\u001b[34m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[34m Tool Input \u001b[0m\u001b[34m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[34m\u2500\u256e\u001b[0m\n", + "\u001b[34m\u2502\u001b[0m \u001b[34m\u2502\u001b[0m\n", + "\u001b[34m\u2502\u001b[0m \u001b[38;2;230;219;116;49m\"{\\\"query\\\": \\\"FA Cup third round draw Manchester United vs Arsenal Tamworth vs Tottenham\\\"}\"\u001b[0m \u001b[34m\u2502\u001b[0m\n", + "\u001b[34m\u2502\u001b[0m \u001b[34m\u2502\u001b[0m\n", + "\u001b[34m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Task Completion \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task Completed                                                                                                 \u2502\n",
+       "\u2502  Name: d883be8b-ac2a-4678-80b3-afdc803bd716                                                                     \u2502\n",
+       "\u2502  Agent: Research Expert                                                                                         \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[32md883be8b-ac2a-4678-80b3-afdc803bd716\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mResearch Expert\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Task Completion \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502  Task Completed                                                                                                 \u2502\n",
+       "\u2502  Name: 674a305d-1a6f-4b60-9497-ff4140f0f473                                                                     \u2502\n",
+       "\u2502  Agent: Technical Writer                                                                                        \u2502\n",
+       "\u2502  Tool Args:                                                                                                     \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[32m\u256d\u2500\u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m Task Completion \u001b[0m\u001b[32m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[32m\u2500\u256e\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[1;32mTask Completed\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mName: \u001b[0m\u001b[32m674a305d-1a6f-4b60-9497-ff4140f0f473\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mAgent: \u001b[0m\u001b[32mTechnical Writer\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[37mTool Args: \u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2502\u001b[0m \u001b[32m\u2502\u001b[0m\n", + "\u001b[32m\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n",
+       "
\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query completed in 38.89 seconds\n", + "================================================================================\n", + "RESPONSE\n", + "================================================================================\n", + "**FA Cup Third Round Draw: A Comprehensive Overview**\n", + "\n", + "The FA Cup third round draw is a pivotal moment in the English football calendar, marking the entry of Premier League and Championship clubs into the competition. This stage often brings thrilling encounters and the potential for giant-killing acts, capturing the imagination of fans worldwide. The significance of the third round is underscored by the rich history and tradition of the FA Cup, the world's oldest national football competition.\n", + "\n", + "**Manchester United vs Arsenal**\n", + "\n", + "One of the standout fixtures of the third round is the clash between Manchester United and Arsenal. This match is set to take place over the weekend of Saturday, 11 January. Manchester United, the current holders of the FA Cup, will travel to face Arsenal, who have won the competition a record 14 times. The match is significant as it involves two of the most successful clubs in FA Cup history, both known for their storied pasts and passionate fanbases.\n", + "\n", + "- **Date and Venue:** Weekend of Saturday, 11 January, at Arsenal's home ground.\n", + "- **Team Statistics:** Manchester United have lifted the FA Cup 13 times, while Arsenal hold the record with 14 victories.\n", + "- **Recent Form:** Manchester United recently triumphed over Manchester City to claim their 13th FA Cup title, showcasing their competitive edge.\n", + "- **Predictions and Insights:** Given the historical rivalry and the stakes involved, this fixture promises to be a fiercely contested battle, with both teams eager to progress further in the tournament.\n", + "\n", + "**Tamworth vs Tottenham**\n", + "\n", + "Another intriguing fixture is the match between non-league side Tamworth and Premier League club Tottenham Hotspur. Tamworth, one of only two non-league clubs remaining in the competition, will host Spurs, highlighting the classic \"David vs Goliath\" narrative that the FA Cup is renowned for.\n", + "\n", + "- **Date and Venue:** To be played at Tamworth's home ground over the weekend of Saturday, 11 January.\n", + "- **Team Statistics:** Tamworth is the lowest-ranked team remaining in the competition, while Tottenham is a well-established Premier League club.\n", + "- **Recent Form:** Tamworth secured their place in the third round with a dramatic penalty shootout victory against League One side Burton Albion.\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "# Disable logging before running the query\n", + "logging.disable(logging.CRITICAL)\n", + "\n", + "query = \"What are the key details about the FA Cup third round draw? Include information about Manchester United vs Arsenal, Tamworth vs Tottenham, and other notable fixtures.\"\n", + "process_query(query, researcher, writer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "By following these steps, you've built a powerful RAG system that combines Couchbase's vector storage capabilities with CrewAI's agent-based architecture. This multi-agent approach separates research and writing concerns, resulting in higher quality responses to user queries.\n", + "\n", + "The system demonstrates several key advantages:\n", + "1. Efficient vector search using Couchbase's vector store\n", + "2. Specialized AI agents that focus on different aspects of the RAG pipeline\n", + "3. Collaborative workflow between agents to produce comprehensive, well-structured responses\n", + "4. Scalable architecture that can be extended with additional agents for more complex tasks\n", + "\n", + "Whether you're building a customer support system, a research assistant, or a knowledge management solution, this agent-based RAG approach provides a flexible foundation that can be adapted to various use cases and domains." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/crewai/fts/crew_index.json b/crewai/search_based/crew_index.json similarity index 100% rename from crewai/fts/crew_index.json rename to crewai/search_based/crew_index.json diff --git a/crewai/fts/frontmatter.md b/crewai/search_based/frontmatter.md similarity index 100% rename from crewai/fts/frontmatter.md rename to crewai/search_based/frontmatter.md diff --git a/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb deleted file mode 100644 index d31b6509..00000000 --- a/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ /dev/null @@ -1,566 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# BBC News Dataset RAG Pipeline with Couchbase and OpenAI\n", - "\n", - "This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using:\n", - "- The BBC News dataset containing real-time news articles\n", - "- Couchbase Capella as the vector store with FTS (Full Text Search)\n", - "- Haystack framework for the RAG pipeline\n", - "- OpenAI for embeddings and text generation\n", - "\n", - "The system allows users to ask questions about current events and get AI-generated answers based on the latest news articles." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Installing Necessary Libraries\n", - "\n", - "To build our RAG system, we need a set of libraries. The libraries we install handle everything from connecting to databases to performing AI tasks. Each library has a specific role: Couchbase libraries manage database operations, Haystack handles AI model integrations and pipeline management, and we will use the OpenAI SDK for generating embeddings and calling OpenAI's language models." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install datasets haystack-ai couchbase-haystack openai pandas" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Importing Necessary Libraries\n", - "\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, Haystack components for RAG pipeline, embedding generation, and dataset loading." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "import base64\n", - "import logging\n", - "import sys\n", - "import time\n", - "import pandas as pd\n", - "from datetime import timedelta\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import CouchbaseException\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from haystack import Pipeline, GeneratedAnswer\n", - "from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder\n", - "from haystack.components.preprocessors import DocumentCleaner\n", - "from haystack.components.writers import DocumentWriter\n", - "from haystack.components.builders.answer_builder import AnswerBuilder\n", - "from haystack.components.builders.prompt_builder import PromptBuilder\n", - "from haystack.components.generators import OpenAIGenerator\n", - "from haystack.utils import Secret\n", - "from haystack.dataclasses import Document\n", - "\n", - "from couchbase_haystack import (\n", - " CouchbaseSearchDocumentStore,\n", - " CouchbasePasswordAuthenticator,\n", - " CouchbaseClusterOptions,\n", - " CouchbaseSearchEmbeddingRetriever,\n", - ")\n", - "from couchbase.options import KnownConfigProfiles\n", - "\n", - "# Configure logging\n", - "logger = logging.getLogger(__name__)\n", - "logger.setLevel(logging.DEBUG)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Prerequisites\n", - "\n", - "## Create and Deploy Your Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.\n", - "\n", - "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:\n", - "\n", - "* Have a multi-node Capella cluster running the Data, Query, Index, and Search services.\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n", - "\n", - "### OpenAI Models Setup\n", - "\n", - "In order to create the RAG application, we need an embedding model to ingest the documents for Vector Search and a large language model (LLM) for generating the responses based on the context. \n", - "\n", - "For this implementation, we'll use OpenAI's models which provide state-of-the-art performance for both embeddings and text generation:\n", - "\n", - "**Embedding Model**: We'll use OpenAI's `text-embedding-3-large` model, which provides high-quality embeddings with 3,072 dimensions for semantic search capabilities.\n", - "\n", - "**Large Language Model**: We'll use OpenAI's `gpt-4o` model for generating responses based on the retrieved context. This model offers excellent reasoning capabilities and can handle complex queries effectively.\n", - "\n", - "**Prerequisites for OpenAI Integration**:\n", - "* Create an OpenAI account at [platform.openai.com](https://platform.openai.com)\n", - "* Generate an API key from your OpenAI dashboard\n", - "* Ensure you have sufficient credits or a valid payment method set up\n", - "* Set up your API key as an environment variable or input it securely in the notebook\n", - "\n", - "For more details about OpenAI's models and pricing, please refer to the [OpenAI documentation](https://platform.openai.com/docs/models)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Configure Couchbase Credentials\n", - "\n", - "Enter your Couchbase and OpenAI credentials:\n", - "\n", - "**OPENAI_API_KEY** is your OpenAI API key which can be obtained from your OpenAI dashboard at [platform.openai.com](https://platform.openai.com/api-keys).\n", - "\n", - "**INDEX_NAME** is the name of the FTS search index we will use for vector search operations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "CB_CONNECTION_STRING = input(\"Couchbase Cluster URL (default: localhost): \") or \"localhost\"\n", - "CB_USERNAME = input(\"Couchbase Username (default: admin): \") or \"admin\"\n", - "CB_PASSWORD = input(\"Couchbase password (default: Password@12345): \") or \"Password@12345\"\n", - "CB_BUCKET_NAME = input(\"Couchbase Bucket: \")\n", - "CB_SCOPE_NAME = input(\"Couchbase Scope: \")\n", - "CB_COLLECTION_NAME = input(\"Couchbase Collection: \")\n", - "CB_INDEX_NAME = input(\"Vector Search Index: \")\n", - "OPENAI_API_KEY = input(\"OpenAI API Key: \")\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, CB_OPENAI_API_KEY]):\n", - " raise ValueError(\"All configuration variables must be provided.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from couchbase.cluster import Cluster \n", - "from couchbase.options import ClusterOptions\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.collections import CollectionSpec\n", - "from couchbase.management.search import SearchIndex\n", - "import json\n", - "\n", - "# Connect to Couchbase cluster\n", - "cluster = Cluster(CB_CONNECTION_STRING, ClusterOptions(\n", - " PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)))\n", - "\n", - "# Create bucket if it does not exist\n", - "bucket_manager = cluster.buckets()\n", - "try:\n", - " bucket_manager.get_bucket(CB_BUCKET_NAME)\n", - " print(f\"Bucket '{CB_BUCKET_NAME}' already exists.\")\n", - "except Exception as e:\n", - " print(f\"Bucket '{CB_BUCKET_NAME}' does not exist. Creating bucket...\")\n", - " bucket_settings = CreateBucketSettings(name=CB_BUCKET_NAME, ram_quota_mb=500)\n", - " bucket_manager.create_bucket(bucket_settings)\n", - " print(f\"Bucket '{CB_BUCKET_NAME}' created successfully.\")\n", - "\n", - "# Create scope and collection if they do not exist\n", - "collection_manager = cluster.bucket(CB_BUCKET_NAME).collections()\n", - "scopes = collection_manager.get_all_scopes()\n", - "scope_exists = any(scope.name == CB_SCOPE_NAME for scope in scopes)\n", - "\n", - "if scope_exists:\n", - " print(f\"Scope '{CB_SCOPE_NAME}' already exists.\")\n", - "else:\n", - " print(f\"Scope '{CB_SCOPE_NAME}' does not exist. Creating scope...\")\n", - " collection_manager.create_scope(CB_SCOPE_NAME)\n", - " print(f\"Scope '{CB_SCOPE_NAME}' created successfully.\")\n", - "\n", - "collections = [collection.name for scope in scopes if scope.name == CB_SCOPE_NAME for collection in scope.collections]\n", - "collection_exists = CB_COLLECTION_NAME in collections\n", - "\n", - "if collection_exists:\n", - " print(f\"Collection '{CB_COLLECTION_NAME}' already exists in scope '{CB_SCOPE_NAME}'.\")\n", - "else:\n", - " print(f\"Collection '{CB_COLLECTION_NAME}' does not exist in scope '{CB_SCOPE_NAME}'. Creating collection...\")\n", - " collection_manager.create_collection(collection_name=CB_COLLECTION_NAME, scope_name=CB_SCOPE_NAME)\n", - " print(f\"Collection '{CB_COLLECTION_NAME}' created successfully.\")\n", - "\n", - "# Create search index from search_index.json file at scope level\n", - "with open('fts_index.json', 'r') as search_file:\n", - " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", - " \n", - " # Update search index definition with user inputs\n", - " search_index_definition.name = CB_INDEX_NAME\n", - " search_index_definition.source_name = CB_BUCKET_NAME\n", - " \n", - " # Update types mapping\n", - " old_type_key = next(iter(search_index_definition.params['mapping']['types'].keys()))\n", - " type_obj = search_index_definition.params['mapping']['types'].pop(old_type_key)\n", - " search_index_definition.params['mapping']['types'][f\"{CB_SCOPE_NAME}.{CB_COLLECTION_NAME}\"] = type_obj\n", - " \n", - " search_index_name = search_index_definition.name\n", - " \n", - " # Get scope-level search manager\n", - " scope_search_manager = cluster.bucket(CB_BUCKET_NAME).scope(CB_SCOPE_NAME).search_indexes()\n", - " \n", - " try:\n", - " # Check if index exists at scope level\n", - " existing_index = scope_search_manager.get_index(search_index_name)\n", - " print(f\"Search index '{search_index_name}' already exists at scope level.\")\n", - " except Exception as e:\n", - " print(f\"Search index '{search_index_name}' does not exist at scope level. Creating search index from fts_index.json...\")\n", - " with open('fts_index.json', 'r') as search_file:\n", - " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", - " scope_search_manager.upsert_index(search_index_definition)\n", - " print(f\"Search index '{search_index_name}' created successfully at scope level.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load and Process Movie Dataset\n", - "\n", - "Load the TMDB movie dataset and prepare documents for indexing:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load TMDB dataset\n", - "print(\"Loading TMDB dataset...\")\n", - "dataset = load_dataset(\"AiresPucrs/tmdb-5000-movies\")\n", - "movies_df = pd.DataFrame(dataset['train'])\n", - "print(f\"Total movies found: {len(movies_df)}\")\n", - "\n", - "# Create documents from movie data\n", - "docs_data = []\n", - "for _, row in movies_df.iterrows():\n", - " if pd.isna(row['overview']):\n", - " continue\n", - " \n", - " try:\n", - " docs_data.append({\n", - " 'id': str(row[\"id\"]),\n", - " 'content': f\"Title: {row['title']}\\nGenres: {', '.join([genre['name'] for genre in eval(row['genres'])])}\\nOverview: {row['overview']}\",\n", - " 'metadata': {\n", - " 'title': row['title'],\n", - " 'genres': row['genres'],\n", - " 'original_language': row['original_language'],\n", - " 'popularity': float(row['popularity']),\n", - " 'release_date': row['release_date'],\n", - " 'vote_average': float(row['vote_average']),\n", - " 'vote_count': int(row['vote_count']),\n", - " 'budget': int(row['budget']),\n", - " 'revenue': int(row['revenue'])\n", - " }\n", - " })\n", - " except Exception as e:\n", - " logger.error(f\"Error processing movie {row['title']}: {e}\")\n", - "\n", - "print(f\"Created {len(docs_data)} documents with valid overviews\")\n", - "documents = [Document(id=doc['id'], content=doc['content'], meta=doc['metadata']) \n", - " for doc in docs_data]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Initialize Document Store\n", - "\n", - "Set up the Couchbase document store for storing movie data and embeddings:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize document store\n", - "document_store = CouchbaseSearchDocumentStore(\n", - " cluster_connection_string=Secret.from_token(CB_CONNECTION_STRING),\n", - " authenticator=CouchbasePasswordAuthenticator(\n", - " username=Secret.from_token(CB_USERNAME),\n", - " password=Secret.from_token(CB_PASSWORD)\n", - " ),\n", - " cluster_options=CouchbaseClusterOptions(\n", - " profile=KnownConfigProfiles.WanDevelopment,\n", - " ),\n", - " bucket=CB_BUCKET_NAME,\n", - " scope=CB_SCOPE_NAME,\n", - " collection=CB_COLLECTION_NAME,\n", - " vector_search_index=CB_INDEX_NAME,\n", - ")\n", - "\n", - "print(\"Couchbase document store initialized successfully.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Initialize Embedder for Document Embedding\n", - "\n", - "Configure the document embedder using Capella AI's endpoint and the E5 Mistral model. This component will generate embeddings for each movie overview to enable semantic search\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "embedder = OpenAIDocumentEmbedder(\n", - " api_key=Secret.from_token(OPENAI_API_KEY),\n", - " model=\"text-embedding-3-large\",\n", - ")\n", - "\n", - "rag_embedder = OpenAITextEmbedder(\n", - " api_key=Secret.from_token(OPENAI_API_KEY),\n", - " model=\"text-embedding-3-large\",\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Initialize LLM Generator\n", - "Configure the LLM generator using Capella AI's endpoint and Llama 3.1 model. This component will generate natural language responses based on the retrieved documents.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "llm = OpenAIGenerator(\n", - " api_key=Secret.from_token(OPENAI_API_KEY),\n", - " model=\"gpt-4o\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Create Indexing Pipeline\n", - "Build the pipeline for processing and indexing movie documents:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create indexing pipeline\n", - "index_pipeline = Pipeline()\n", - "index_pipeline.add_component(\"cleaner\", DocumentCleaner())\n", - "index_pipeline.add_component(\"embedder\", embedder)\n", - "index_pipeline.add_component(\"writer\", DocumentWriter(document_store=document_store))\n", - "\n", - "# Connect indexing components\n", - "index_pipeline.connect(\"cleaner.documents\", \"embedder.documents\")\n", - "index_pipeline.connect(\"embedder.documents\", \"writer.documents\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Run Indexing Pipeline\n", - "\n", - "Execute the pipeline for processing and indexing movie documents:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Run indexing pipeline\n", - "\n", - "if documents:\n", - " # Process documents in batches for better performance\n", - " batch_size = 100\n", - " total_docs = len(documents)\n", - " \n", - " for i in range(0, total_docs, batch_size):\n", - " batch = documents[i:i + batch_size]\n", - " result = index_pipeline.run({\"cleaner\": {\"documents\": batch}})\n", - " print(f\"Processed batch {i//batch_size + 1}: {len(batch)} documents\")\n", - " \n", - " print(f\"\\nSuccessfully processed {total_docs} documents\")\n", - " print(f\"Sample document metadata: {documents[0].meta}\")\n", - "else:\n", - " print(\"No documents created. Skipping indexing.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Create RAG Pipeline\n", - "\n", - "Set up the Retrieval Augmented Generation pipeline for answering questions about movies:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define RAG prompt template\n", - "prompt_template = \"\"\"\n", - "Given these documents, answer the question.\\nDocuments:\n", - "{% for doc in documents %}\n", - " {{ doc.content }}\n", - "{% endfor %}\n", - "\n", - "\\nQuestion: {{question}}\n", - "\\nAnswer:\n", - "\"\"\"\n", - "\n", - "# Create RAG pipeline\n", - "rag_pipeline = Pipeline()\n", - "\n", - "# Add components\n", - "rag_pipeline.add_component(\n", - " \"query_embedder\",\n", - " rag_embedder,\n", - ")\n", - "rag_pipeline.add_component(\"retriever\", CouchbaseSearchEmbeddingRetriever(document_store=document_store))\n", - "rag_pipeline.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template))\n", - "rag_pipeline.add_component(\"llm\",llm)\n", - "rag_pipeline.add_component(\"answer_builder\", AnswerBuilder())\n", - "\n", - "# Connect RAG components\n", - "rag_pipeline.connect(\"query_embedder\", \"retriever.query_embedding\")\n", - "rag_pipeline.connect(\"retriever.documents\", \"prompt_builder.documents\")\n", - "rag_pipeline.connect(\"prompt_builder.prompt\", \"llm.prompt\")\n", - "rag_pipeline.connect(\"llm.replies\", \"answer_builder.replies\")\n", - "rag_pipeline.connect(\"llm.meta\", \"answer_builder.meta\")\n", - "rag_pipeline.connect(\"retriever\", \"answer_builder.documents\")\n", - "\n", - "print(\"RAG pipeline created successfully.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Ask Questions About Movies\n", - "\n", - "Use the RAG pipeline to ask questions about movies and get AI-generated answers:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Example question\n", - "question = \"Who does Savva want to save from the vicious hyenas?\"\n", - "\n", - "# Run the RAG pipeline\n", - "result = rag_pipeline.run(\n", - " {\n", - " \"query_embedder\": {\"text\": question},\n", - " \"retriever\": {\"top_k\": 5},\n", - " \"prompt_builder\": {\"question\": question},\n", - " \"answer_builder\": {\"query\": question},\n", - " },\n", - " include_outputs_from={\"retriever\", \"query_embedder\"}\n", - ")\n", - "\n", - "# Get the generated answer\n", - "answer: GeneratedAnswer = result[\"answer_builder\"][\"answers\"][0]\n", - "\n", - "# Print retrieved documents\n", - "print(\"=== Retrieved Documents ===\")\n", - "retrieved_docs = result[\"retriever\"][\"documents\"]\n", - "for idx, doc in enumerate(retrieved_docs, start=1):\n", - " print(f\"Id: {doc.id} Title: {doc.meta['title']}\")\n", - "\n", - "# Print final results\n", - "print(\"\\n=== Final Answer ===\")\n", - "print(f\"Question: {answer.query}\")\n", - "print(f\"Answer: {answer.data}\")\n", - "print(\"\\nSources:\")\n", - "for doc in answer.documents:\n", - " print(f\"-> {doc.meta['title']}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Conclusion\n", - "\n", - "In this tutorial, we built a Retrieval-Augmented Generation (RAG) system using Couchbase Capella, OpenAI, and Haystack with the BBC News dataset. This demonstrates how to combine vector search capabilities with large language models to answer questions about current events using real-time information.\n", - "\n", - "The key components include:\n", - "- **Couchbase Capella** for vector storage and FTS-based retrieval\n", - "- **Haystack** for pipeline orchestration and component management \n", - "- **OpenAI** for embeddings (`text-embedding-3-large`) and text generation (`gpt-4o`)\n", - "\n", - "This approach enables AI applications to access and reason over current information that extends beyond the LLM's training data, making responses more accurate and relevant for real-world use cases." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "haystack", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb similarity index 98% rename from haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb rename to haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index d07c49b0..f826e15a 100644 --- a/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -16,7 +16,7 @@ "\n", "We leverage Couchbase's Global Secondary Index (GSI) vector search capabilities to create and manage vector indexes, enabling efficient semantic search capabilities. GSI provides high-performance vector search with support for both Hyperscale Vector Indexes and Composite Vector Indexes, designed to scale to billions of vectors with low memory footprint and optimized concurrent operations.\n", "\n", - "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using OpenAI Services and Haystack with Couchbase's advanced GSI vector search." + "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using OpenAI Services and Haystack with Couchbase's advanced GSI vector search. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html)." ] }, { @@ -687,7 +687,7 @@ "\n", "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", "\n", - "In this section, we'll set up the Couchbase vector store using GSI (Global Secondary Index) for high-performance vector search.\n", + "In this section, we'll set up the Couchbase vector store using Couchbase Hyperscale and Composite Vector Indexes for high-performance vector search.\n", "\n", "GSI vector search supports two main index types:\n", "\n", @@ -705,7 +705,7 @@ "- Combines a standard Global Secondary index (GSI) with a single vector column\n", "- Designed for searches using a single vector value along with standard scalar values that filter out large portions of the dataset. The scalar attributes in a query reduce the number of vectors the Couchbase Server has to compare when performing a vector search to find similar vectors.\n", "- Consume a moderate amount of memory and can index billions of documents.\n", - "- Work well for cases where your queries are highly selective — returning a small number of results from a large dataset\n", + "- Work well for cases where your queries are highly selective\u2009\u2014\u2009returning a small number of results from a large dataset\n", "\n", "Use Composite Vector indexes when you want to perform searches of documents using both scalars and a vector where the scalar values filter out large portions of the dataset.\n", "\n", @@ -735,7 +735,7 @@ "\n", "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/server/current/vector-index/hyperscale-vector-index.html#algo_settings).\n", "\n", - "In the code below, we demonstrate creating a BHIVE index for optimal performance. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI. " + "In the code below, we demonstrate creating a BHIVE index for optimal performance. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, Hyperscale and Composite Vector indexes can be created manually from the Couchbase UI. " ] }, { @@ -865,4 +865,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/haystack/gsi/frontmatter.md b/haystack/query_based/frontmatter.md similarity index 100% rename from haystack/gsi/frontmatter.md rename to haystack/query_based/frontmatter.md diff --git a/haystack/fts/requirements.txt b/haystack/query_based/requirements.txt similarity index 100% rename from haystack/fts/requirements.txt rename to haystack/query_based/requirements.txt diff --git a/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb new file mode 100644 index 00000000..be052be2 --- /dev/null +++ b/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -0,0 +1,566 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# BBC News Dataset RAG Pipeline with Couchbase and OpenAI\n", + "\n", + "This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using:\n", + "- The BBC News dataset containing real-time news articles\n", + "- Couchbase Capella as the vector store with FTS (Full Text Search)\n", + "- Haystack framework for the RAG pipeline\n", + "- OpenAI for embeddings and text generation\n", + "\n", + "The system allows users to ask questions about current events and get AI-generated answers based on the latest news articles. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Installing Necessary Libraries\n", + "\n", + "To build our RAG system, we need a set of libraries. The libraries we install handle everything from connecting to databases to performing AI tasks. Each library has a specific role: Couchbase libraries manage database operations, Haystack handles AI model integrations and pipeline management, and we will use the OpenAI SDK for generating embeddings and calling OpenAI's language models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install datasets haystack-ai couchbase-haystack openai pandas" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Importing Necessary Libraries\n", + "\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, Haystack components for RAG pipeline, embedding generation, and dataset loading." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import base64\n", + "import logging\n", + "import sys\n", + "import time\n", + "import pandas as pd\n", + "from datetime import timedelta\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import CouchbaseException\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from haystack import Pipeline, GeneratedAnswer\n", + "from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder\n", + "from haystack.components.preprocessors import DocumentCleaner\n", + "from haystack.components.writers import DocumentWriter\n", + "from haystack.components.builders.answer_builder import AnswerBuilder\n", + "from haystack.components.builders.prompt_builder import PromptBuilder\n", + "from haystack.components.generators import OpenAIGenerator\n", + "from haystack.utils import Secret\n", + "from haystack.dataclasses import Document\n", + "\n", + "from couchbase_haystack import (\n", + " CouchbaseSearchDocumentStore,\n", + " CouchbasePasswordAuthenticator,\n", + " CouchbaseClusterOptions,\n", + " CouchbaseSearchEmbeddingRetriever,\n", + ")\n", + "from couchbase.options import KnownConfigProfiles\n", + "\n", + "# Configure logging\n", + "logger = logging.getLogger(__name__)\n", + "logger.setLevel(logging.DEBUG)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Prerequisites\n", + "\n", + "## Create and Deploy Your Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.\n", + "\n", + "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:\n", + "\n", + "* Have a multi-node Capella cluster running the Data, Query, Index, and Search services.\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n", + "\n", + "### OpenAI Models Setup\n", + "\n", + "In order to create the RAG application, we need an embedding model to ingest the documents for Vector Search and a large language model (LLM) for generating the responses based on the context. \n", + "\n", + "For this implementation, we'll use OpenAI's models which provide state-of-the-art performance for both embeddings and text generation:\n", + "\n", + "**Embedding Model**: We'll use OpenAI's `text-embedding-3-large` model, which provides high-quality embeddings with 3,072 dimensions for semantic search capabilities.\n", + "\n", + "**Large Language Model**: We'll use OpenAI's `gpt-4o` model for generating responses based on the retrieved context. This model offers excellent reasoning capabilities and can handle complex queries effectively.\n", + "\n", + "**Prerequisites for OpenAI Integration**:\n", + "* Create an OpenAI account at [platform.openai.com](https://platform.openai.com)\n", + "* Generate an API key from your OpenAI dashboard\n", + "* Ensure you have sufficient credits or a valid payment method set up\n", + "* Set up your API key as an environment variable or input it securely in the notebook\n", + "\n", + "For more details about OpenAI's models and pricing, please refer to the [OpenAI documentation](https://platform.openai.com/docs/models)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Configure Couchbase Credentials\n", + "\n", + "Enter your Couchbase and OpenAI credentials:\n", + "\n", + "**OPENAI_API_KEY** is your OpenAI API key which can be obtained from your OpenAI dashboard at [platform.openai.com](https://platform.openai.com/api-keys).\n", + "\n", + "**INDEX_NAME** is the name of the FTS search index we will use for vector search operations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "CB_CONNECTION_STRING = input(\"Couchbase Cluster URL (default: localhost): \") or \"localhost\"\n", + "CB_USERNAME = input(\"Couchbase Username (default: admin): \") or \"admin\"\n", + "CB_PASSWORD = input(\"Couchbase password (default: Password@12345): \") or \"Password@12345\"\n", + "CB_BUCKET_NAME = input(\"Couchbase Bucket: \")\n", + "CB_SCOPE_NAME = input(\"Couchbase Scope: \")\n", + "CB_COLLECTION_NAME = input(\"Couchbase Collection: \")\n", + "CB_INDEX_NAME = input(\"Vector Search Index: \")\n", + "OPENAI_API_KEY = input(\"OpenAI API Key: \")\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, CB_OPENAI_API_KEY]):\n", + " raise ValueError(\"All configuration variables must be provided.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from couchbase.cluster import Cluster \n", + "from couchbase.options import ClusterOptions\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.collections import CollectionSpec\n", + "from couchbase.management.search import SearchIndex\n", + "import json\n", + "\n", + "# Connect to Couchbase cluster\n", + "cluster = Cluster(CB_CONNECTION_STRING, ClusterOptions(\n", + " PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)))\n", + "\n", + "# Create bucket if it does not exist\n", + "bucket_manager = cluster.buckets()\n", + "try:\n", + " bucket_manager.get_bucket(CB_BUCKET_NAME)\n", + " print(f\"Bucket '{CB_BUCKET_NAME}' already exists.\")\n", + "except Exception as e:\n", + " print(f\"Bucket '{CB_BUCKET_NAME}' does not exist. Creating bucket...\")\n", + " bucket_settings = CreateBucketSettings(name=CB_BUCKET_NAME, ram_quota_mb=500)\n", + " bucket_manager.create_bucket(bucket_settings)\n", + " print(f\"Bucket '{CB_BUCKET_NAME}' created successfully.\")\n", + "\n", + "# Create scope and collection if they do not exist\n", + "collection_manager = cluster.bucket(CB_BUCKET_NAME).collections()\n", + "scopes = collection_manager.get_all_scopes()\n", + "scope_exists = any(scope.name == CB_SCOPE_NAME for scope in scopes)\n", + "\n", + "if scope_exists:\n", + " print(f\"Scope '{CB_SCOPE_NAME}' already exists.\")\n", + "else:\n", + " print(f\"Scope '{CB_SCOPE_NAME}' does not exist. Creating scope...\")\n", + " collection_manager.create_scope(CB_SCOPE_NAME)\n", + " print(f\"Scope '{CB_SCOPE_NAME}' created successfully.\")\n", + "\n", + "collections = [collection.name for scope in scopes if scope.name == CB_SCOPE_NAME for collection in scope.collections]\n", + "collection_exists = CB_COLLECTION_NAME in collections\n", + "\n", + "if collection_exists:\n", + " print(f\"Collection '{CB_COLLECTION_NAME}' already exists in scope '{CB_SCOPE_NAME}'.\")\n", + "else:\n", + " print(f\"Collection '{CB_COLLECTION_NAME}' does not exist in scope '{CB_SCOPE_NAME}'. Creating collection...\")\n", + " collection_manager.create_collection(collection_name=CB_COLLECTION_NAME, scope_name=CB_SCOPE_NAME)\n", + " print(f\"Collection '{CB_COLLECTION_NAME}' created successfully.\")\n", + "\n", + "# Create search index from search_index.json file at scope level\n", + "with open('fts_index.json', 'r') as search_file:\n", + " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", + " \n", + " # Update search index definition with user inputs\n", + " search_index_definition.name = CB_INDEX_NAME\n", + " search_index_definition.source_name = CB_BUCKET_NAME\n", + " \n", + " # Update types mapping\n", + " old_type_key = next(iter(search_index_definition.params['mapping']['types'].keys()))\n", + " type_obj = search_index_definition.params['mapping']['types'].pop(old_type_key)\n", + " search_index_definition.params['mapping']['types'][f\"{CB_SCOPE_NAME}.{CB_COLLECTION_NAME}\"] = type_obj\n", + " \n", + " search_index_name = search_index_definition.name\n", + " \n", + " # Get scope-level search manager\n", + " scope_search_manager = cluster.bucket(CB_BUCKET_NAME).scope(CB_SCOPE_NAME).search_indexes()\n", + " \n", + " try:\n", + " # Check if index exists at scope level\n", + " existing_index = scope_search_manager.get_index(search_index_name)\n", + " print(f\"Search index '{search_index_name}' already exists at scope level.\")\n", + " except Exception as e:\n", + " print(f\"Search index '{search_index_name}' does not exist at scope level. Creating search index from fts_index.json...\")\n", + " with open('fts_index.json', 'r') as search_file:\n", + " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", + " scope_search_manager.upsert_index(search_index_definition)\n", + " print(f\"Search index '{search_index_name}' created successfully at scope level.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load and Process Movie Dataset\n", + "\n", + "Load the TMDB movie dataset and prepare documents for indexing:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load TMDB dataset\n", + "print(\"Loading TMDB dataset...\")\n", + "dataset = load_dataset(\"AiresPucrs/tmdb-5000-movies\")\n", + "movies_df = pd.DataFrame(dataset['train'])\n", + "print(f\"Total movies found: {len(movies_df)}\")\n", + "\n", + "# Create documents from movie data\n", + "docs_data = []\n", + "for _, row in movies_df.iterrows():\n", + " if pd.isna(row['overview']):\n", + " continue\n", + " \n", + " try:\n", + " docs_data.append({\n", + " 'id': str(row[\"id\"]),\n", + " 'content': f\"Title: {row['title']}\\nGenres: {', '.join([genre['name'] for genre in eval(row['genres'])])}\\nOverview: {row['overview']}\",\n", + " 'metadata': {\n", + " 'title': row['title'],\n", + " 'genres': row['genres'],\n", + " 'original_language': row['original_language'],\n", + " 'popularity': float(row['popularity']),\n", + " 'release_date': row['release_date'],\n", + " 'vote_average': float(row['vote_average']),\n", + " 'vote_count': int(row['vote_count']),\n", + " 'budget': int(row['budget']),\n", + " 'revenue': int(row['revenue'])\n", + " }\n", + " })\n", + " except Exception as e:\n", + " logger.error(f\"Error processing movie {row['title']}: {e}\")\n", + "\n", + "print(f\"Created {len(docs_data)} documents with valid overviews\")\n", + "documents = [Document(id=doc['id'], content=doc['content'], meta=doc['metadata']) \n", + " for doc in docs_data]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Initialize Document Store\n", + "\n", + "Set up the Couchbase document store for storing movie data and embeddings:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize document store\n", + "document_store = CouchbaseSearchDocumentStore(\n", + " cluster_connection_string=Secret.from_token(CB_CONNECTION_STRING),\n", + " authenticator=CouchbasePasswordAuthenticator(\n", + " username=Secret.from_token(CB_USERNAME),\n", + " password=Secret.from_token(CB_PASSWORD)\n", + " ),\n", + " cluster_options=CouchbaseClusterOptions(\n", + " profile=KnownConfigProfiles.WanDevelopment,\n", + " ),\n", + " bucket=CB_BUCKET_NAME,\n", + " scope=CB_SCOPE_NAME,\n", + " collection=CB_COLLECTION_NAME,\n", + " vector_search_index=CB_INDEX_NAME,\n", + ")\n", + "\n", + "print(\"Couchbase document store initialized successfully.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Initialize Embedder for Document Embedding\n", + "\n", + "Configure the document embedder using Capella AI's endpoint and the E5 Mistral model. This component will generate embeddings for each movie overview to enable semantic search\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "embedder = OpenAIDocumentEmbedder(\n", + " api_key=Secret.from_token(OPENAI_API_KEY),\n", + " model=\"text-embedding-3-large\",\n", + ")\n", + "\n", + "rag_embedder = OpenAITextEmbedder(\n", + " api_key=Secret.from_token(OPENAI_API_KEY),\n", + " model=\"text-embedding-3-large\",\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Initialize LLM Generator\n", + "Configure the LLM generator using Capella AI's endpoint and Llama 3.1 model. This component will generate natural language responses based on the retrieved documents.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "llm = OpenAIGenerator(\n", + " api_key=Secret.from_token(OPENAI_API_KEY),\n", + " model=\"gpt-4o\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create Indexing Pipeline\n", + "Build the pipeline for processing and indexing movie documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create indexing pipeline\n", + "index_pipeline = Pipeline()\n", + "index_pipeline.add_component(\"cleaner\", DocumentCleaner())\n", + "index_pipeline.add_component(\"embedder\", embedder)\n", + "index_pipeline.add_component(\"writer\", DocumentWriter(document_store=document_store))\n", + "\n", + "# Connect indexing components\n", + "index_pipeline.connect(\"cleaner.documents\", \"embedder.documents\")\n", + "index_pipeline.connect(\"embedder.documents\", \"writer.documents\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Run Indexing Pipeline\n", + "\n", + "Execute the pipeline for processing and indexing movie documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run indexing pipeline\n", + "\n", + "if documents:\n", + " # Process documents in batches for better performance\n", + " batch_size = 100\n", + " total_docs = len(documents)\n", + " \n", + " for i in range(0, total_docs, batch_size):\n", + " batch = documents[i:i + batch_size]\n", + " result = index_pipeline.run({\"cleaner\": {\"documents\": batch}})\n", + " print(f\"Processed batch {i//batch_size + 1}: {len(batch)} documents\")\n", + " \n", + " print(f\"\\nSuccessfully processed {total_docs} documents\")\n", + " print(f\"Sample document metadata: {documents[0].meta}\")\n", + "else:\n", + " print(\"No documents created. Skipping indexing.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create RAG Pipeline\n", + "\n", + "Set up the Retrieval Augmented Generation pipeline for answering questions about movies:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define RAG prompt template\n", + "prompt_template = \"\"\"\n", + "Given these documents, answer the question.\\nDocuments:\n", + "{% for doc in documents %}\n", + " {{ doc.content }}\n", + "{% endfor %}\n", + "\n", + "\\nQuestion: {{question}}\n", + "\\nAnswer:\n", + "\"\"\"\n", + "\n", + "# Create RAG pipeline\n", + "rag_pipeline = Pipeline()\n", + "\n", + "# Add components\n", + "rag_pipeline.add_component(\n", + " \"query_embedder\",\n", + " rag_embedder,\n", + ")\n", + "rag_pipeline.add_component(\"retriever\", CouchbaseSearchEmbeddingRetriever(document_store=document_store))\n", + "rag_pipeline.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template))\n", + "rag_pipeline.add_component(\"llm\",llm)\n", + "rag_pipeline.add_component(\"answer_builder\", AnswerBuilder())\n", + "\n", + "# Connect RAG components\n", + "rag_pipeline.connect(\"query_embedder\", \"retriever.query_embedding\")\n", + "rag_pipeline.connect(\"retriever.documents\", \"prompt_builder.documents\")\n", + "rag_pipeline.connect(\"prompt_builder.prompt\", \"llm.prompt\")\n", + "rag_pipeline.connect(\"llm.replies\", \"answer_builder.replies\")\n", + "rag_pipeline.connect(\"llm.meta\", \"answer_builder.meta\")\n", + "rag_pipeline.connect(\"retriever\", \"answer_builder.documents\")\n", + "\n", + "print(\"RAG pipeline created successfully.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Ask Questions About Movies\n", + "\n", + "Use the RAG pipeline to ask questions about movies and get AI-generated answers:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example question\n", + "question = \"Who does Savva want to save from the vicious hyenas?\"\n", + "\n", + "# Run the RAG pipeline\n", + "result = rag_pipeline.run(\n", + " {\n", + " \"query_embedder\": {\"text\": question},\n", + " \"retriever\": {\"top_k\": 5},\n", + " \"prompt_builder\": {\"question\": question},\n", + " \"answer_builder\": {\"query\": question},\n", + " },\n", + " include_outputs_from={\"retriever\", \"query_embedder\"}\n", + ")\n", + "\n", + "# Get the generated answer\n", + "answer: GeneratedAnswer = result[\"answer_builder\"][\"answers\"][0]\n", + "\n", + "# Print retrieved documents\n", + "print(\"=== Retrieved Documents ===\")\n", + "retrieved_docs = result[\"retriever\"][\"documents\"]\n", + "for idx, doc in enumerate(retrieved_docs, start=1):\n", + " print(f\"Id: {doc.id} Title: {doc.meta['title']}\")\n", + "\n", + "# Print final results\n", + "print(\"\\n=== Final Answer ===\")\n", + "print(f\"Question: {answer.query}\")\n", + "print(f\"Answer: {answer.data}\")\n", + "print(\"\\nSources:\")\n", + "for doc in answer.documents:\n", + " print(f\"-> {doc.meta['title']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "In this tutorial, we built a Retrieval-Augmented Generation (RAG) system using Couchbase Capella, OpenAI, and Haystack with the BBC News dataset. This demonstrates how to combine vector search capabilities with large language models to answer questions about current events using real-time information.\n", + "\n", + "The key components include:\n", + "- **Couchbase Capella** for vector storage and FTS-based retrieval\n", + "- **Haystack** for pipeline orchestration and component management \n", + "- **OpenAI** for embeddings (`text-embedding-3-large`) and text generation (`gpt-4o`)\n", + "\n", + "This approach enables AI applications to access and reason over current information that extends beyond the LLM's training data, making responses more accurate and relevant for real-world use cases." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "haystack", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/haystack/fts/frontmatter.md b/haystack/search_based/frontmatter.md similarity index 100% rename from haystack/fts/frontmatter.md rename to haystack/search_based/frontmatter.md diff --git a/haystack/gsi/requirements.txt b/haystack/search_based/requirements.txt similarity index 100% rename from haystack/gsi/requirements.txt rename to haystack/search_based/requirements.txt diff --git a/haystack/fts/fts_index.json b/haystack/search_based/search_vector_index.json similarity index 100% rename from haystack/fts/fts_index.json rename to haystack/search_based/search_vector_index.json diff --git a/huggingface/fts/.env.sample b/huggingface/query_based/.env.sample similarity index 100% rename from huggingface/fts/.env.sample rename to huggingface/query_based/.env.sample diff --git a/huggingface/gsi/frontmatter.md b/huggingface/query_based/frontmatter.md similarity index 100% rename from huggingface/gsi/frontmatter.md rename to huggingface/query_based/frontmatter.md diff --git a/huggingface/gsi/hugging_face.ipynb b/huggingface/query_based/hugging_face.ipynb similarity index 90% rename from huggingface/gsi/hugging_face.ipynb rename to huggingface/query_based/hugging_face.ipynb index 79d501e2..ef7ab0fe 100644 --- a/huggingface/gsi/hugging_face.ipynb +++ b/huggingface/query_based/hugging_face.ipynb @@ -27,7 +27,7 @@ "\n", "This guide is designed to be comprehensive yet accessible, with clear step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system. Whether you're building a recommendation engine, content discovery platform, or any application requiring intelligent document retrieval, this tutorial provides the foundation you need.\n", "\n", - "**Note**: If you want to perform semantic search using the FTS (Full-Text Search) index instead, please take a look at [this alternative approach](https://developer.couchbase.com//tutorial-huggingface-couchbase-vector-search-with-fts)." + "**Note**: If you want to perform semantic search using the FTS (Full-Text Search) index instead, please take a look at [this alternative approach](https://developer.couchbase.com//tutorial-huggingface-couchbase-vector-search-with-search-vector-index). For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html)." ] }, { @@ -314,7 +314,7 @@ "id": "154912ee", "metadata": {}, "source": [ - "### Optimizing Vector Search with Global Secondary Index (GSI)" + "### Optimizing Vector Search with Hyperscale and Composite Vector Indexes" ] }, { @@ -402,7 +402,7 @@ "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", "- **Features**: \n", " - Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", - " - Best for well-defined workloads requiring complex filtering using GSI features\n", + " - Best for well-defined workloads requiring complex filtering using Hyperscale and Composite Vector Index features\n", " - Supports range lookups combined with vector search" ] }, @@ -419,7 +419,7 @@ "id": "4ac316b5", "metadata": {}, "source": [ - "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using GSI. BHIVE is ideal for semantic search scenarios where you want:\n", + "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using Hyperscale and Composite Vector Indexes. BHIVE is ideal for semantic search scenarios where you want:\n", "\n", "1. **High-performance vector search** across large datasets\n", "2. **Low latency** for real-time applications\n", @@ -675,8 +675,8 @@ ], "source": [ "texts = [\n", - " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\",\n", - " \"It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", + " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\",\n", + " \"It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", " input(\"Enter custom embedding text:\")\n", "]\n", "vector_store.add_texts(texts=texts, batch_size=32)" @@ -698,7 +698,7 @@ "Now let's demonstrate the performance benefits of different optimization approaches available in Couchbase. We'll compare three optimization levels to show how each contributes to building a production-ready semantic search system:\n", "\n", "1. **Baseline (Raw Search)**: Basic vector similarity search without GSI optimization\n", - "2. **GSI-Optimized Search**: High-performance search using BHIVE GSI index\n", + "2. **Vector Index-Optimized Search**: High-performance search using BHIVE GSI index\n", "3. **Cache Benefits**: Show how caching can be applied on top of any search approach\n", "\n", "**Important**: Caching is orthogonal to index types - you can apply caching benefits to both raw searches and GSI-optimized searches to improve repeated query performance." @@ -816,11 +816,11 @@ "\n", "[Result 1]\n", "Vector Distance: 0.586197 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\n", "\n", "[Result 2]\n", "Vector Distance: 0.645435 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "Document Content: It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", "\n", "[Result 3]\n", "Vector Distance: 0.976888 (lower = more similar)\n", @@ -863,7 +863,7 @@ "output_type": "stream", "text": [ "Creating BHIVE GSI vector index...\n", - "✓ BHIVE GSI vector index created successfully!\n", + "\u2713 BHIVE GSI vector index created successfully!\n", "Waiting for index to become available...\n", "\n", "Testing performance with BHIVE GSI optimization...\n", @@ -875,11 +875,11 @@ "\n", "[Result 1]\n", "Vector Distance: 0.586197 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\n", "\n", "[Result 2]\n", "Vector Distance: 0.645435 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "Document Content: It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", "\n", "[Result 3]\n", "Vector Distance: 0.976888 (lower = more similar)\n", @@ -897,7 +897,7 @@ " distance_metric=DistanceStrategy.COSINE,\n", " index_name=\"huggingface_bhive_index\",\n", " )\n", - " print(\"✓ BHIVE GSI vector index created successfully!\")\n", + " print(\"\u2713 BHIVE GSI vector index created successfully!\")\n", " \n", " # Wait for index to become available\n", " print(\"Waiting for index to become available...\")\n", @@ -905,14 +905,14 @@ " \n", "except Exception as e:\n", " if \"already exists\" in str(e).lower():\n", - " print(\"✓ BHIVE GSI vector index already exists, proceeding...\")\n", + " print(\"\u2713 BHIVE GSI vector index already exists, proceeding...\")\n", " else:\n", " print(f\"Error creating GSI index: {str(e)}\")\n", "\n", "# Test the same query with GSI optimization\n", "print(\"\\nTesting performance with BHIVE GSI optimization...\")\n", "gsi_time, gsi_results = search_with_performance_metrics(\n", - " test_query, \"Phase 2: GSI-Optimized Search\"\n", + " test_query, \"Phase 2: Vector Index-Optimized Search\"\n", ")" ] }, @@ -943,7 +943,7 @@ "output_type": "stream", "text": [ "Setting up Couchbase cache for improved performance on repeated queries...\n", - "✓ Couchbase cache enabled!\n", + "\u2713 Couchbase cache enabled!\n", "\n", "Testing cache benefits with a different query...\n", "First execution (cache miss):\n", @@ -955,11 +955,11 @@ "\n", "[Result 1]\n", "Vector Distance: 0.632770 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\n", "\n", "[Result 2]\n", "Vector Distance: 0.677951 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "Document Content: It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", "\n", "Second execution (cache hit):\n", "\n", @@ -970,11 +970,11 @@ "\n", "[Result 1]\n", "Vector Distance: 0.632770 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\n", "\n", "[Result 2]\n", "Vector Distance: 0.677951 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n" + "Document Content: It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n" ] } ], @@ -988,7 +988,7 @@ " collection_name=couchbase_collection,\n", ")\n", "set_llm_cache(cache)\n", - "print(\"✓ Couchbase cache enabled!\")\n", + "print(\"\u2713 Couchbase cache enabled!\")\n", "\n", "# Test cache benefits with the same query (should show improvement on second run)\n", "cache_query = \"How does a distributed database handle high-speed operations?\"\n", @@ -1036,7 +1036,7 @@ "VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", "================================================================================\n", "Phase 1 - Baseline (Raw Search): 0.1484 seconds\n", - "Phase 2 - GSI-Optimized Search: 0.0848 seconds\n", + "Phase 2 - Vector Index-Optimized Search: 0.0848 seconds\n", "Phase 3 - Cache Benefits:\n", " First execution (cache miss): 0.1024 seconds\n", " Second execution (cache hit): 0.0289 seconds\n", @@ -1048,10 +1048,10 @@ "Cache Benefit: 3.55x faster (71.8% improvement)\n", "\n", "Key Insights:\n", - "• GSI optimization provides consistent performance benefits, especially with larger datasets\n", - "• Caching benefits apply to both raw and GSI-optimized searches\n", - "• Combined GSI + Cache provides the best performance for production applications\n", - "• BHIVE indexes scale to billions of vectors with optimized concurrent operations\n" + "\u2022 GSI optimization provides consistent performance benefits, especially with larger datasets\n", + "\u2022 Caching benefits apply to both raw and GSI-optimized searches\n", + "\u2022 Combined GSI + Cache provides the best performance for production applications\n", + "\u2022 BHIVE indexes scale to billions of vectors with optimized concurrent operations\n" ] } ], @@ -1061,7 +1061,7 @@ "print(\"=\"*80)\n", "\n", "print(f\"Phase 1 - Baseline (Raw Search): {baseline_time:.4f} seconds\")\n", - "print(f\"Phase 2 - GSI-Optimized Search: {gsi_time:.4f} seconds\")\n", + "print(f\"Phase 2 - Vector Index-Optimized Search: {gsi_time:.4f} seconds\")\n", "print(f\"Phase 3 - Cache Benefits:\")\n", "print(f\" First execution (cache miss): {cache_time_1:.4f} seconds\")\n", "print(f\" Second execution (cache hit): {cache_time_2:.4f} seconds\")\n", @@ -1087,10 +1087,10 @@ " print(f\"Cache Benefit: No significant improvement (results may be cached already)\")\n", "\n", "print(f\"\\nKey Insights:\")\n", - "print(f\"• GSI optimization provides consistent performance benefits, especially with larger datasets\")\n", - "print(f\"• Caching benefits apply to both raw and GSI-optimized searches\")\n", - "print(f\"• Combined GSI + Cache provides the best performance for production applications\")\n", - "print(f\"• BHIVE indexes scale to billions of vectors with optimized concurrent operations\")" + "print(f\"\u2022 GSI optimization provides consistent performance benefits, especially with larger datasets\")\n", + "print(f\"\u2022 Caching benefits apply to both raw and GSI-optimized searches\")\n", + "print(f\"\u2022 Combined GSI + Cache provides the best performance for production applications\")\n", + "print(f\"\u2022 BHIVE indexes scale to billions of vectors with optimized concurrent operations\")" ] }, { @@ -1131,11 +1131,11 @@ "\n", "[Result 2]\n", "Vector Distance: 0.860599 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "Document Content: It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", "\n", "[Result 3]\n", "Vector Distance: 0.909207 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n" + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\n" ] }, { @@ -1144,9 +1144,9 @@ "(0.08118820190429688,\n", " [(Document(id='e20a8dcd8b464e8e819b87c9a0ff05c3', metadata={}, page_content='this is a sample text with the data \"hello\"'),\n", " 0.6236441411684932),\n", - " (Document(id='0442f351aec2415481138315d492ee80', metadata={}, page_content='It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.'),\n", + " (Document(id='0442f351aec2415481138315d492ee80', metadata={}, page_content='It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.'),\n", " 0.8605992009935179),\n", - " (Document(id='7c601881e4bf4c53b5b4c2a25628d904', metadata={}, page_content='Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.'),\n", + " (Document(id='7c601881e4bf4c53b5b4c2a25628d904', metadata={}, page_content='Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.'),\n", " 0.9092065785676496)])" ] }, @@ -1157,7 +1157,7 @@ ], "source": [ "custom_query = input(\"Enter your search query: \")\n", - "search_with_performance_metrics(custom_query, \"Interactive GSI-Optimized Search\")\n" + "search_with_performance_metrics(custom_query, \"Interactive Vector Index-Optimized Search\")\n" ] }, { @@ -1205,4 +1205,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/huggingface/gsi/.env.sample b/huggingface/search_based/.env.sample similarity index 100% rename from huggingface/gsi/.env.sample rename to huggingface/search_based/.env.sample diff --git a/huggingface/fts/frontmatter.md b/huggingface/search_based/frontmatter.md similarity index 100% rename from huggingface/fts/frontmatter.md rename to huggingface/search_based/frontmatter.md diff --git a/huggingface/fts/hugging_face.ipynb b/huggingface/search_based/hugging_face.ipynb similarity index 93% rename from huggingface/fts/hugging_face.ipynb rename to huggingface/search_based/hugging_face.ipynb index 31b436a8..e9c9ff7a 100644 --- a/huggingface/fts/hugging_face.ipynb +++ b/huggingface/search_based/hugging_face.ipynb @@ -7,7 +7,7 @@ "source": [ "# Introduction\n", "\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Hugging Face](https://huggingface.co/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com//tutorial-huggingface-couchbase-vector-search-with-global-secondary-index)" + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Hugging Face](https://huggingface.co/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively, if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com//tutorial-huggingface-couchbase-vector-search-with-hyperscale-or-composite-vector-index)" ] }, { @@ -252,8 +252,8 @@ "outputs": [], "source": [ "texts = [\n", - " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\",\n", - " \"It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", + " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\",\n", + " \"It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", " input(\"Enter custom embedding text:\")\n", "]\n", "embeddings = []\n", @@ -307,7 +307,7 @@ "text": [ "Vector similarity search for phrase: \"name a multipurpose database with distributed capability\"\n", "Found answer: 3993ec2e-c184-4d7f-8fc3-55961afe264c; score: 0.9256534967756203\n", - "Answer text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Answer text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\n", "------\n", "Vector similarity search for phrase: \"What is the data in the sample text?\"\n", "Found answer: a7748fac-b41f-4846-bebc-d89bdcd645e3; score: 1.0016003788325407\n", @@ -367,4 +367,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/huggingface/fts/huggingface_index.json b/huggingface/search_based/huggingface_index.json similarity index 100% rename from huggingface/fts/huggingface_index.json rename to huggingface/search_based/huggingface_index.json diff --git a/jinaai/fts/RAG_with_Couchbase_and_Jina_AI.ipynb b/jinaai/fts/RAG_with_Couchbase_and_Jina_AI.ipynb deleted file mode 100644 index 434db0a8..00000000 --- a/jinaai/fts/RAG_with_Couchbase_and_Jina_AI.ipynb +++ /dev/null @@ -1,1110 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "kNdImxzypDlm" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Jina](https://jina.ai/) as the AI-powered embedding and language model provider, utilizing Full-Text Search (FTS). Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-jina-couchbase-rag-with-global-secondary-index)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/jinaai/fts/RAG_with_Couchbase_and_Jina_AI.ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for Jina AI\n", - "\n", - "* Please follow the [instructions](https://jina.ai/) to generate the Jina AI credentials.\n", - "* Please follow the [instructions](https://chat.jina.ai/api) to generate the JinaChat credentials.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NH2o6pqa69oG" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and Jina provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "DYhPj0Ta8l_A" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "# Jina doesnt support openai other than 0.27\n", - "%pip install --quiet datasets==3.6.0 langchain-couchbase==0.3.0 langchain-community==0.3.24 openai==0.27 python-dotenv==1.1.0 ipywidgets" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1pp7GtNg8mB9" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "8GzS6tfL8mFP" - }, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,\n", - " ServiceUnavailableException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_community.chat_models import JinaChat\n", - "from langchain_community.embeddings import JinaEmbeddings\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.output_parsers import StrOutputParser\n", - "from langchain_core.prompts import ChatPromptTemplate\n", - "from langchain_core.prompts.chat import ChatPromptTemplate\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pBnMp5vb8mIb" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "Yv8kWcuf8mLx" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s',force=True)\n", - "\n", - "# Suppress all logs from specific loggers\n", - "logging.getLogger('openai').setLevel(logging.WARNING)\n", - "logging.getLogger('httpx').setLevel(logging.WARNING)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K9G5a0en8mPA" - }, - "source": [ - "# Loading Sensitive Informnation\n", - "In this section, we prompt the user to input essential configuration settings needed for integrating Couchbase with Cohere's API. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "PFGyHll18mSe" - }, - "outputs": [], - "source": [ - "load_dotenv(\"./.env\") \n", - "\n", - "JINA_API_KEY = os.getenv(\"JINA_API_KEY\")\n", - "JINACHAT_API_KEY = os.getenv(\"JINACHAT_API_KEY\")\n", - "\n", - "CB_HOST = os.getenv(\"CB_HOST\") or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv(\"CB_USERNAME\") or 'Administrator'\n", - "CB_PASSWORD = os.getenv(\"CB_PASSWORD\") or 'password'\n", - "CB_BUCKET_NAME = os.getenv(\"CB_BUCKET_NAME\") or 'vector-search-testing'\n", - "INDEX_NAME = os.getenv(\"INDEX_NAME\") or 'vector_search_jina'\n", - "\n", - "SCOPE_NAME = os.getenv(\"SCOPE_NAME\") or 'shared'\n", - "COLLECTION_NAME = os.getenv(\"COLLECTION_NAME\") or 'jina'\n", - "CACHE_COLLECTION = os.getenv(\"CACHE_COLLECTION\") or 'cache'\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not JINA_API_KEY:\n", - " raise ValueError(\"JINA_API_KEY environment variable is not set\")\n", - "if not JINACHAT_API_KEY:\n", - " raise ValueError(\"JINACHAT_API_KEY environment variable is not set\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtGrYzUY8mV3" - }, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "Zb3kK-7W8mZK" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:45:51,014 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C_Gpy32N8mcZ" - }, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Creates primary index on collection for query performance\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "ACZcwUnG8mf2" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:45:56,608 - INFO - Bucket 'vector-search-testing' exists.\n", - "2025-09-23 10:45:59,312 - INFO - Collection 'jina' already exists. Skipping creation.\n", - "2025-09-23 10:46:02,683 - INFO - Primary index present or created successfully.\n", - "2025-09-23 10:46:03,447 - INFO - All documents cleared from the collection.\n", - "2025-09-23 10:46:03,449 - INFO - Bucket 'vector-search-testing' exists.\n", - "2025-09-23 10:46:06,152 - INFO - Collection 'jina_cache' already exists. Skipping creation.\n", - "2025-09-23 10:46:09,482 - INFO - Primary index present or created successfully.\n", - "2025-09-23 10:46:09,804 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Ensure primary index exists\n", - " try:\n", - " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", - " logging.info(\"Primary index present or created successfully.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error creating primary index: {str(e)}\")\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NMJ7RRYp8mjV" - }, - "source": [ - "# Loading Couchbase Vector Search Index\n", - "\n", - "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", - "\n", - "This Jina vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `jina`. The configuration is set up for vectors with exactly `1024 dimensions`, using dot product similarity and optimized for recall. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", - "\n", - "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "y7xiCrOc8mmj" - }, - "outputs": [], - "source": [ - "# If you are running this script locally (not in Google Colab), uncomment the following line\n", - "# and provide the path to your index definition file.\n", - "\n", - "# index_definition_path = '/path_to_your_index_file/jina_index.json' # Local setup: specify your file path here\n", - "\n", - "# # Version for Google Colab\n", - "# def load_index_definition_colab():\n", - "# from google.colab import files\n", - "# print(\"Upload your index definition file\")\n", - "# uploaded = files.upload()\n", - "# index_definition_path = list(uploaded.keys())[0]\n", - "\n", - "# try:\n", - "# with open(index_definition_path, 'r') as file:\n", - "# index_definition = json.load(file)\n", - "# return index_definition\n", - "# except Exception as e:\n", - "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Version for Local Environment\n", - "def load_index_definition_local(index_definition_path):\n", - " try:\n", - " with open(index_definition_path, 'r') as file:\n", - " index_definition = json.load(file)\n", - " return index_definition\n", - " except Exception as e:\n", - " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Usage\n", - "# Uncomment the appropriate line based on your environment\n", - "# index_definition = load_index_definition_colab()\n", - "index_definition = load_index_definition_local('jina_index.json')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v_ddPQ_Y8mpm" - }, - "source": [ - "# Creating or Updating Search Indexes\n", - "\n", - "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "bHEpUu1l8msx" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:47:03,763 - INFO - Index 'vector_search_jina' found\n", - "2025-09-23 10:47:04,742 - INFO - Index 'vector_search_jina' already exists. Skipping creation/update.\n" - ] - } - ], - "source": [ - "try:\n", - " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", - "\n", - " # Check if index already exists\n", - " existing_indexes = scope_index_manager.get_all_indexes()\n", - " index_name = index_definition[\"name\"]\n", - "\n", - " if index_name in [index.name for index in existing_indexes]:\n", - " logging.info(f\"Index '{index_name}' found\")\n", - " else:\n", - " logging.info(f\"Creating new index '{index_name}'...\")\n", - "\n", - " # Create SearchIndex object from JSON definition\n", - " search_index = SearchIndex.from_json(index_definition)\n", - "\n", - " # Upsert the index (create if not exists, update if exists)\n", - " scope_index_manager.upsert_index(search_index)\n", - " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", - "\n", - "except QueryIndexAlreadyExistsException:\n", - " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", - "except ServiceUnavailableException:\n", - " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", - "except InternalServerFailureException as e:\n", - " logging.error(f\"Internal server error: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7FvxRsg38m3G" - }, - "source": [ - "# Creating Jina Embeddings\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using Jina, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "id": "_75ZyCRh8m6m" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:47:06,326 - INFO - Successfully created JinaEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = JinaEmbeddings(\n", - " jina_api_key=JINA_API_KEY, model_name=\"jina-embeddings-v3\"\n", - " )\n", - " logging.info(\"Successfully created JinaEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating JinaEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8IwZMUnF8m-N" - }, - "source": [ - "# Setting Up the Couchbase Vector Store\n", - "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "id": "DwIJQjYT9RV_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:47:12,343 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseSearchVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " index_name=INDEX_NAME,\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:47:18,035 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Error Handling: If an error occurs, only the current batch is affected\n", - "3. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "4. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 50 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:50:03,866 - INFO - Document ingestion completed successfully\n" - ] - } - ], - "source": [ - "# Calculate 60% of the dataset size and round to nearest integer\n", - "dataset_size = len(unique_news_articles)\n", - "subset_size = round(dataset_size * 0.6)\n", - "\n", - "# Filter articles by length and create subset\n", - "filtered_articles = [article for article in unique_news_articles[:subset_size] \n", - " if article and len(article) <= 50000]\n", - "\n", - "# Process in batches\n", - "batch_size = 50\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=filtered_articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully\")\n", - " \n", - "except CouchbaseException as e:\n", - " logging.error(f\"Couchbase error during ingestion: {str(e)}\")\n", - " raise RuntimeError(f\"Error performing document ingestion: {str(e)}\")\n", - "except Exception as e:\n", - " if \"Payment Required\" in str(e):\n", - " logging.error(\"Payment required for Jina AI API. Please check your subscription status and API key.\")\n", - " print(\"To resolve this error:\")\n", - " print(\"1. Visit 'https://jina.ai/reader/#pricing' to review subscription options\")\n", - " print(\"2. Ensure your API key is valid and has sufficient credits\") \n", - " print(\"3. Consider upgrading your subscription plan if needed\")\n", - " else:\n", - " logging.error(f\"Unexpected error during ingestion: {str(e)}\")\n", - " raise RuntimeError(f\"Failed to save documents to vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8Pn8-dQw9RfQ" - }, - "source": [ - "# Setting Up a Couchbase Cache\n", - "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", - "\n", - "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": { - "id": "V2y7dyjf9Rid" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:50:21,526 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uehAx36o9Rlm" - }, - "source": [ - "# Creating the Jina Language Model (LLM)\n", - "Language models are AI systems that are trained to understand and generate human language. We'll be using Jina's language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", - "\n", - "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": { - "id": "yRAfBRLH9RpO" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:50:22,466 - INFO - Successfully created JinaChat\n" - ] - } - ], - "source": [ - "try:\n", - " llm = JinaChat(temperature=0.1, jinachat_api_key=JINACHAT_API_KEY)\n", - " logging.info(\"Successfully created JinaChat\")\n", - "except Exception as e:\n", - " logging.error(f\"Error creating JinaChat: {str(e)}. Please check your API key and network connection.\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "afOOEECGiLuQ" - }, - "source": [ - "## Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined.\n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the similarity_search_with_score method of the CouchbaseSearchVectorStore. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison.\n", - "\n", - "### Note on Retry Mechanism\n", - "The search implementation includes a retry mechanism to handle rate limiting and API errors gracefully. If a rate limit error (HTTP 429) is encountered, the system will automatically retry the request up to 3 times with exponential backoff, waiting 2 seconds initially and doubling the wait time between each retry. This helps manage API usage limits while maintaining service reliability. For other types of errors, such as payment requirements or general failures, appropriate error messages and troubleshooting steps are provided to help diagnose and resolve the issue." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": { - "id": "y3oO33_LiLxU" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:50:25,678 - INFO - Semantic search completed in 2.13 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 2.13 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.6798, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", - "\n", - "Pep Guardiola has not been through a moment like this in his managerial career. Manchester City have lost nine matches in their past 12 - as many defeats as they had suffered in their previous 106 fixtures. At the end of October, City were still unbeaten at the top of the Premier League and favourites to win a fifth successive title. Now they are seventh, 12 points behind leaders Liverpool having played a game more. It has been an incredible fall from grace and left people trying to work out what has happened - and whether Guardiola can make it right. After discussing the situation with those who know him best, I have taken a closer look at the future - both short and long term - and how the current crisis at Man City is going to be solved.\n", - "\n", - "Pep Guardiola's Man City have lost nine of their past 12 matches\n", - "\n", - "Guardiola has also been giving it a lot of thought. He has not been sleeping very well, as he has said, and has not been himself at times when talking to the media. He has been talking to a lot of people about what is going on as he tries to work out the reasons for City's demise. Some reasons he knows, others he still doesn't. What people perhaps do not realise is Guardiola hugely doubts himself and always has. He will be thinking \"I'm not going to be able to get us out of this\" and needs the support of people close to him to push away those insecurities - and he has that. He is protected by his people who are very aware, like he is, that there are a lot of people that want City to fail. It has been a turbulent time for Guardiola. Remember those marks he had on his head after the 3-3 draw with Feyenoord in the Champions League? He always scratches his head, it is a gesture of nervousness. Normally nothing happens but on that day one of his nails was far too sharp so, after talking to the players in the changing room where he scratched his head because of his usual agitated gesturing, he went to the news conference. His right-hand man Manel Estiarte sent him photos in a message saying \"what have you got on your head?\", but by the time Guardiola returned to the coaching room there was hardly anything there again. He started that day with a cover on his nose after the same thing happened at the training ground the day before. Guardiola was having a footballing debate with Kyle Walker about positional stuff and marked his nose with that same nail. There was also that remarkable news conference after the Manchester derby when he said \"I don't know what to do\". That is partly true and partly not true. Ignore the fact Guardiola suggested he was \"not good enough\". He actually meant he was not good enough to resolve the situation with the group of players he has available and with all the other current difficulties. There are obviously logical explanations for the crisis and the first one has been talked about many times - the absence of injured midfielder Rodri. You know the game Jenga? When you take the wrong piece out, the whole tower collapses. That is what has happened here. It is normal for teams to have an over-reliance on one player if he is the best in the world in his position. And you cannot calculate the consequences of an injury that rules someone like Rodri out for the season. City are a team, like many modern ones, in which the holding midfielder is a key element to the construction. So, when you take Rodri out, it is difficult to hold it together. There were Plan Bs - John Stones, Manuel Akanji, even Nathan Ake - but injuries struck. The big injury list has been out of the ordinary and the busy calendar has also played a part in compounding the issues. However, one factor even Guardiola cannot explain is the big uncharacteristic errors in almost every game from international players. Why did Matheus Nunes make that challenge to give away the penalty against Manchester United? Jack Grealish is sent on at the end to keep the ball and cannot do that. There are errors from Walker and other defenders. These are some of the best players in the world. Of course the players' mindset is important, and confidence is diminishing. Wrong decisions get taken so there is almost panic on the pitch instead of calm. There are also players badly out of form who are having to play because of injuries. Walker is now unable to hide behind his pace, I'm not sure Kevin de Bruyne is ever getting back to the level he used to be at, Bernardo Silva and Ilkay Gundogan do not have time to rest, Grealish is not playing at his best. Some of these players were only meant to be playing one game a week but, because of injuries, have played 12 games in 40 days. It all has a domino effect. One consequence is that Erling Haaland isn't getting the service to score. But the Norwegian still remains City's top-scorer with 13. Defender Josko Gvardiol is next on the list with just four. The way their form has been analysed inside the City camp is there have only been three games where they deserved to lose (Liverpool, Bournemouth and Aston Villa). But of course it is time to change the dynamic.\n", - "\n", - "Guardiola has never protected his players so much. He has not criticised them and is not going to do so. They have won everything with him. Instead of doing more with them, he has tried doing less. He has sometimes given them more days off to clear their heads, so they can reset - two days this week for instance. Perhaps the time to change a team is when you are winning, but no-one was suggesting Man City were about to collapse when they were top and unbeaten after nine league games. Some people have asked how bad it has to get before City make a decision on Guardiola. The answer is that there is no decision to be made. Maybe if this was Real Madrid, Barcelona or Juventus, the pressure from outside would be massive and the argument would be made that Guardiola has to go. At City he has won the lot, so how can anyone say he is failing? Yes, this is a crisis. But given all their problems, City's renewed target is finishing in the top four. That is what is in all their heads now. The idea is to recover their essence by improving defensive concepts that are not there and re-establishing the intensity they are known for. Guardiola is planning to use the next two years of his contract, which is expected to be his last as a club manager, to prepare a new Manchester City. When he was at the end of his four years at Barcelona, he asked two managers what to do when you feel people are not responding to your instructions. Do you go or do the players go? Sir Alex Ferguson and Rafael Benitez both told him that the players need to go. Guardiola did not listen because of his emotional attachment to his players back then and he decided to leave the Camp Nou because he felt the cycle was over. He will still protect his players now but there is not the same emotional attachment - so it is the players who are going to leave this time. It is likely City will look to replace five or six regular starters. Guardiola knows it is the end of an era and the start of a new one. Changes will not be immediate and the majority of the work will be done in the summer. But they are open to any opportunities in January - and a holding midfielder is one thing they need. In the summer City might want to get Spain's Martin Zubimendi from Real Sociedad and they know 60m euros (£50m) will get him. He said no to Liverpool last summer even though everything was agreed, but he now wants to move on and the Premier League is the target. Even if they do not get Zubimendi, that is the calibre of footballer they are after. A new Manchester City is on its way - with changes driven by Guardiola, incoming sporting director Hugo Viana and the football department.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.6795, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", - "\n", - "Pep Guardiola has not been through a moment like this in his managerial career. Manchester City have lost nine matches in their past 12 - as many defeats as they had suffered in their previous 106 fixtures. At the end of October, City were still unbeaten at the top of the Premier League and favourites to win a fifth successive title. Now they are seventh, 12 points behind leaders Liverpool having played a game more. It has been an incredible fall from grace and left people trying to work out what has happened - and whether Guardiola can make it right. After discussing the situation with those who know him best, I have taken a closer look at the future - both short and long term - and how the current crisis at Man City is going to be solved.\n", - "\n", - "Pep Guardiola's Man City have lost nine of their past 12 matches\n", - "\n", - "Guardiola has also been giving it a lot of thought. He has not been sleeping very well, as he has said, and has not been himself at times when talking to the media. He has been talking to a lot of people about what is going on as he tries to work out the reasons for City's demise. Some reasons he knows, others he still doesn't. What people perhaps do not realise is Guardiola hugely doubts himself and always has. He will be thinking \"I'm not going to be able to get us out of this\" and needs the support of people close to him to push away those insecurities - and he has that. He is protected by his people who are very aware, like he is, that there are a lot of people that want City to fail. It has been a turbulent time for Guardiola. Remember those marks he had on his head after the 3-3 draw with Feyenoord in the Champions League? He always scratches his head, it is a gesture of nervousness. Normally nothing happens but on that day one of his nails was far too sharp so, after talking to the players in the changing room where he scratched his head because of his usual agitated gesturing, he went to the news conference. His right-hand man Manel Estiarte sent him photos in a message saying \"what have you got on your head?\", but by the time Guardiola returned to the coaching room there was hardly anything there again. He started that day with a cover on his nose after the same thing happened at the training ground the day before. Guardiola was having a footballing debate with Kyle Walker about positional stuff and marked his nose with that same nail. There was also that remarkable news conference after the Manchester derby when he said \"I don't know what to do\". That is partly true and partly not true. Ignore the fact Guardiola suggested he was \"not good enough\". He actually meant he was not good enough to resolve the situation with the group of players he has available and with all the other current difficulties. There are obviously logical explanations for the crisis and the first one has been talked about many times - the absence of injured midfielder Rodri. You know the game Jenga? When you take the wrong piece out, the whole tower collapses. That is what has happened here. It is normal for teams to have an over-reliance on one player if he is the best in the world in his position. And you cannot calculate the consequences of an injury that rules someone like Rodri out for the season. City are a team, like many modern ones, in which the holding midfielder is a key element to the construction. So, when you take Rodri out, it is difficult to hold it together. There were Plan Bs - John Stones, Manuel Akanji, even Nathan Ake - but injuries struck. The big injury list has been out of the ordinary and the busy calendar has also played a part in compounding the issues. However, one factor even Guardiola cannot explain is the big uncharacteristic errors in almost every game from international players. Why did Matheus Nunes make that challenge to give away the penalty against Manchester United? Jack Grealish is sent on at the end to keep the ball and cannot do that. There are errors from Walker and other defenders. These are some of the best players in the world. Of course the players' mindset is important, and confidence is diminishing. Wrong decisions get taken so there is almost panic on the pitch instead of calm. There are also players badly out of form who are having to play because of injuries. Walker is now unable to hide behind his pace, I'm not sure Kevin de Bruyne is ever getting back to the level he used to be at, Bernardo Silva and Ilkay Gundogan do not have time to rest, Grealish is not playing at his best. Some of these players were only meant to be playing one game a week but, because of injuries, have played 12 games in 40 days. It all has a domino effect. One consequence is that Erling Haaland isn't getting the service to score. But the Norwegian still remains City's top-scorer with 13. Defender Josko Gvardiol is next on the list with just four. The way their form has been analysed inside the City camp is there have only been three games where they deserved to lose (Liverpool, Bournemouth and Aston Villa). But of course it is time to change the dynamic.\n", - "\n", - "Guardiola has never protected his players so much. He has not criticised them and is not going to do so. They have won everything with him. Instead of doing more with them, he has tried doing less. He has sometimes given them more days off to clear their heads, so they can reset - two days this week for instance. Perhaps the time to change a team is when you are winning, but no-one was suggesting Man City were about to collapse when they were top and unbeaten after nine league games. Some people have asked how bad it has to get before City make a decision on Guardiola. The answer is that there is no decision to be made. Maybe if this was Real Madrid, Barcelona or Juventus, the pressure from outside would be massive and the argument would be made that Guardiola has to go. At City he has won the lot, so how can anyone say he is failing? Yes, this is a crisis. But given all their problems, City's renewed target is finishing in the top four. That is what is in all their heads now. The idea is to recover their essence by improving defensive concepts that are not there and re-establishing the intensity they are known for. Guardiola is planning to use the next two years of his contract, which is expected to be his last as a club manager, to prepare a new Manchester City. When he was at the end of his four years at Barcelona, he asked two managers what to do when you feel people are not responding to your instructions. Do you go or do the players go? Sir Alex Ferguson and Rafael Benitez both told him that the players need to go. Guardiola did not listen because of his emotional attachment to his players back then and he decided to leave the Camp Nou because he felt the cycle was over. He will still protect his players now but there is not the same emotional attachment - so it is the players who are going to leave this time. It is likely City will look to replace five or six regular starters. Guardiola knows it is the end of an era and the start of a new one. Changes will not be immediate and the majority of the work will be done in the summer. But they are open to any opportunities in January - and a holding midfielder is one thing they need. In the summer City might want to get Spain's Martin Zubimendi from Real Sociedad and they know 60m euros (£50m) will get him. He said no to Liverpool last summer even though everything was agreed, but he now wants to move on and the Premier League is the target. Even if they do not get Zubimendi, that is the calibre of footballer they are after. A new Manchester City is on its way - with changes driven by Guardiola, incoming sporting director Hugo Viana and the football department.\n", - "--------------------------------------------------------------------------------\n", - "Score: 0.6207, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", - "\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "def perform_semantic_search(query, vector_store, max_retries=3, retry_delay=2): \n", - " for attempt in range(max_retries):\n", - " try:\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=5)\n", - " search_elapsed_time = time.time() - start_time\n", - " \n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - " return search_results, search_elapsed_time\n", - " \n", - " except Exception as e:\n", - " error_str = str(e)\n", - " \n", - " # Check if it's a rate limit error (HTTP 429)\n", - " if \"http_status: 429\" in error_str or \"query request rejected\" in error_str:\n", - " logging.warning(f\"Rate limit hit (attempt {attempt+1}/{max_retries}). Waiting {retry_delay} seconds...\")\n", - " time.sleep(retry_delay)\n", - " retry_delay *= 2 # Exponential backoff\n", - " \n", - " if attempt == max_retries - 1:\n", - " logging.error(\"Maximum retry attempts reached. API rate limit exceeded.\")\n", - " raise RuntimeError(\"API rate limit exceeded. Please try again later or check your subscription.\")\n", - " else:\n", - " # For other errors, don't retry\n", - " logging.error(f\"Search error: {error_str}\")\n", - " if \"Payment Required\" in error_str:\n", - " raise RuntimeError(\"Payment required for Jina AI API. Please check your subscription status and API key.\")\n", - " else:\n", - " raise RuntimeError(f\"Search failed: {error_str}\")\n", - "\n", - "try:\n", - " query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - " search_results, search_elapsed_time = perform_semantic_search(query, vector_store)\n", - " \n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\"*80)\n", - " for doc, score in search_results:\n", - " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\"*80)\n", - " \n", - "except RuntimeError as e:\n", - " print(f\"Error: {str(e)}\")\n", - " print(\"\\nTroubleshooting steps:\")\n", - " if \"API rate limit\" in str(e):\n", - " print(\"1. Wait a few minutes before trying again\")\n", - " print(\"2. Reduce the frequency of your requests\")\n", - " print(\"3. Consider upgrading your Jina AI plan for higher rate limits\")\n", - " elif \"Payment required\" in str(e):\n", - " print(\"1. Visit 'https://jina.ai/reader/#pricing' to review subscription options\")\n", - " print(\"2. Ensure your API key is valid and has sufficient credits\")\n", - " print(\"3. Update your API key configuration\")\n", - " else:\n", - " print(\"1. Check your network connection\")\n", - " print(\"2. Verify your Couchbase and Jina configurations\")\n", - " print(\"3. Review the vector store implementation for any bugs\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6bp8YEEQiL0r" - }, - "source": [ - "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": { - "id": "fTolIHFpiL30" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:50:26,937 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "try:\n", - " template = \"\"\"You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:\n", - " {context}\n", - "\n", - " Question: {question}\"\"\"\n", - " prompt = ChatPromptTemplate.from_template(template)\n", - "\n", - " rag_chain = (\n", - " {\"context\": vector_store.as_retriever(search_kwargs={\"k\": 2}), \"question\": RunnablePassthrough()}\n", - " | prompt\n", - " | llm\n", - " | StrOutputParser()\n", - " )\n", - " logging.info(\"Successfully created RAG chain\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating RAG chain: {str(e)}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": { - "id": "6GbtJzTEiL7M" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-23 10:50:47,733 - INFO - RAG response generated in 17.23 seconds using k=2\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: Pep Guardiola has been grappling with self-doubt and seeking support to navigate Manchester City's current crisis.\n", - "Response generated in 17.23 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " # Create chain with k=2\n", - " # Start with k=4 and gradually reduce if token limit exceeded\n", - " # k=4 -> k=3 -> k=2 based on token limit warnings\n", - " # Final k=2 produced valid response about Guardiola in 2.33 seconds\n", - " current_chain = (\n", - " {\n", - " \"context\": vector_store.as_retriever(search_kwargs={\"k\": 2}),\n", - " \"question\": RunnablePassthrough()\n", - " }\n", - " | prompt\n", - " | llm\n", - " | StrOutputParser()\n", - " )\n", - " \n", - " # Try to get response\n", - " start_time = time.time()\n", - " rag_response = current_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " \n", - " logging.info(f\"RAG response generated in {elapsed_time:.2f} seconds using k=2\")\n", - " print(f\"RAG Response: {rag_response}\")\n", - " print(f\"Response generated in {elapsed_time:.2f} seconds\")\n", - " \n", - "except Exception as e:\n", - " if \"Payment Required\" in str(e):\n", - " logging.error(\"Payment required for Jina AI API. Please check your subscription status and API key.\")\n", - " print(\"To resolve this error:\")\n", - " print(\"1. Visit 'https://jina.ai/reader/#pricing' to review subscription options\")\n", - " print(\"2. Ensure your API key is valid and has sufficient credits\")\n", - " print(\"3. Consider upgrading your subscription plan if needed\")\n", - " else:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "T8hCgpMyiL-J" - }, - "source": [ - "# Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently." - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": { - "id": "c10Qzeq2Q8N7" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: What happened in the match between Fullham and Liverpool?\n", - "Response: Fulham and Liverpool played to a 2-2 draw at Anfield, with both teams showcasing strong performances.\n", - "Time taken: 5.13 seconds\n", - "\n", - "Query 2: What was manchester city manager pep guardiola's reaction to the team's current form?\n", - "Response: Pep Guardiola has been grappling with self-doubt and seeking support to navigate Manchester City's current crisis.\n", - "Time taken: 2.16 seconds\n", - "\n", - "Query 3: What happened in the match between Fullham and Liverpool?\n", - "Response: Fulham and Liverpool played to a 2-2 draw at Anfield, with both teams showcasing strong performances.\n", - "Time taken: 1.95 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"What happened in the match between Fullham and Liverpool?\",\n", - " \"What was manchester city manager pep guardiola's reaction to the team's current form?\", # Repeated query\n", - " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response}\")\n", - " \n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "except Exception as e:\n", - " if \"Payment Required\" in str(e):\n", - " logging.error(\"Payment required for Jina AI API. Please check your subscription status and API key.\")\n", - " print(\"To resolve this error:\")\n", - " print(\"1. Visit 'https://jina.ai/reader/#pricing' to review subscription options\")\n", - " print(\"2. Ensure your API key is valid and has sufficient credits\")\n", - " print(\"3. Consider upgrading your subscription plan if needed\")\n", - " else:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yJQ5P8E29go1" - }, - "source": [ - "## Conclusion\n", - "By following these steps, you’ll have a fully functional semantic search engine that leverages the strengths of Couchbase and Jina. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you’re a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "colab": { - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.7" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/jinaai/gsi/.env.sample b/jinaai/query_based/.env.sample similarity index 100% rename from jinaai/gsi/.env.sample rename to jinaai/query_based/.env.sample diff --git a/jinaai/gsi/RAG_with_Couchbase_and_Jina_AI.ipynb b/jinaai/query_based/RAG_with_Couchbase_and_Jina_AI.ipynb similarity index 93% rename from jinaai/gsi/RAG_with_Couchbase_and_Jina_AI.ipynb rename to jinaai/query_based/RAG_with_Couchbase_and_Jina_AI.ipynb index 6ee9b44f..265bdcea 100644 --- a/jinaai/gsi/RAG_with_Couchbase_and_Jina_AI.ipynb +++ b/jinaai/query_based/RAG_with_Couchbase_and_Jina_AI.ipynb @@ -21,7 +21,7 @@ "id": "569c4838", "metadata": {}, "source": [ - "This tutorial demonstrates building a high-performance semantic search engine using Couchbase's GSI (Global Secondary Index) vector search and Jina AI for embeddings and language models. We'll show measurable performance improvements with GSI optimization and implement a complete RAG (Retrieval-Augmented Generation) system. Alternatively if you want to perform semantic search using the FTS, please take a look at [this.](https://developer.couchbase.com/tutorial-jina-couchbase-rag-with-fts)\n", + "This tutorial demonstrates building a high-performance semantic search engine using Couchbase's GSI (Global Secondary Index) vector search and Jina AI for embeddings and language models. We'll show measurable performance improvements with GSI optimization and implement a complete RAG (Retrieval-Augmented Generation) system. Alternatively if you want to perform semantic search using the FTS, please take a look at [this.](https://developer.couchbase.com/tutorial-jina-couchbase-rag-with-search-vector-index)\n", "\n", "**Key Features:**\n", "- High-performance GSI vector search with BHIVE indexing\n", @@ -721,7 +721,7 @@ "Now let's demonstrate the performance benefits of GSI optimization by testing pure vector search performance. We'll compare three optimization levels:\n", "\n", "1. **Baseline Performance**: Vector search without GSI optimization\n", - "2. **GSI-Optimized Performance**: Same search with BHIVE GSI index\n", + "2. **Vector Index-Optimized Performance**: Same search with BHIVE GSI index\n", "3. **Cache Benefits**: Show how caching can be applied on top of GSI for repeated queries" ] }, @@ -995,7 +995,7 @@ "id": "d3e24394", "metadata": {}, "source": [ - "### Test 2: GSI-Optimized Performance" + "### Test 2: Vector Index-Optimized Performance" ] }, { @@ -1018,12 +1018,12 @@ "text": [ "Testing vector search performance with BHIVE GSI optimization...\n", "\n", - "[GSI-Optimized Search] Testing vector search performance\n", - "[GSI-Optimized Search] Query: 'What happened in the latest Premier League matches?'\n", - "[GSI-Optimized Search] Vector search completed in 0.6452 seconds\n", - "[GSI-Optimized Search] Found 3 documents\n", - "[GSI-Optimized Search] Top result distance: 0.394714 (lower = more similar)\n", - "[GSI-Optimized Search] Top result preview: The latest updates and analysis from the BBC.\n" + "[Vector Index-Optimized Search] Testing vector search performance\n", + "[Vector Index-Optimized Search] Query: 'What happened in the latest Premier League matches?'\n", + "[Vector Index-Optimized Search] Vector search completed in 0.6452 seconds\n", + "[Vector Index-Optimized Search] Found 3 documents\n", + "[Vector Index-Optimized Search] Top result distance: 0.394714 (lower = more similar)\n", + "[Vector Index-Optimized Search] Top result preview: The latest updates and analysis from the BBC.\n" ] } ], @@ -1031,7 +1031,7 @@ "# Test vector search performance with GSI index\n", "gsi_test_query = \"What happened in the latest Premier League matches?\"\n", "print(\"Testing vector search performance with BHIVE GSI optimization...\")\n", - "gsi_time = test_vector_search_performance(vector_store, gsi_test_query, \"GSI-Optimized Search\")" + "gsi_time = test_vector_search_performance(vector_store, gsi_test_query, \"Vector Index-Optimized Search\")" ] }, { @@ -1061,7 +1061,7 @@ "output_type": "stream", "text": [ "Setting up Couchbase cache for improved performance on repeated queries...\n", - "✓ Couchbase cache enabled!\n" + "\u2713 Couchbase cache enabled!\n" ] } ], @@ -1075,7 +1075,7 @@ " collection_name=COLLECTION_NAME,\n", ")\n", "set_llm_cache(cache)\n", - "print(\"✓ Couchbase cache enabled!\")" + "print(\"\u2713 Couchbase cache enabled!\")" ] }, { @@ -1152,7 +1152,7 @@ "VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", "================================================================================\n", "Phase 1 - Baseline Search (No GSI): 0.8305 seconds\n", - "Phase 2 - GSI-Optimized Search: 0.6452 seconds\n", + "Phase 2 - Vector Index-Optimized Search: 0.6452 seconds\n", "Phase 3 - Cache Benefits:\n", " First execution (cache miss): 0.9695 seconds\n", " Second execution (cache hit): 0.5252 seconds\n", @@ -1164,11 +1164,11 @@ "Cache Benefit: 1.85x faster (45.8% improvement)\n", "\n", "Key Insights for Vector Search Performance:\n", - "• GSI BHIVE indexes provide significant performance improvements for vector similarity search\n", - "• Performance gains are most dramatic for complex semantic queries\n", - "• BHIVE optimization is particularly effective for high-dimensional embeddings\n", - "• Combined with proper quantization (SQ8), GSI delivers production-ready performance\n", - "• These performance improvements directly benefit any application using the vector store\n" + "\u2022 GSI BHIVE indexes provide significant performance improvements for vector similarity search\n", + "\u2022 Performance gains are most dramatic for complex semantic queries\n", + "\u2022 BHIVE optimization is particularly effective for high-dimensional embeddings\n", + "\u2022 Combined with proper quantization (SQ8), GSI delivers production-ready performance\n", + "\u2022 These performance improvements directly benefit any application using the vector store\n" ] } ], @@ -1178,7 +1178,7 @@ "print(\"=\"*80)\n", "\n", "print(f\"Phase 1 - Baseline Search (No GSI): {baseline_time:.4f} seconds\")\n", - "print(f\"Phase 2 - GSI-Optimized Search: {gsi_time:.4f} seconds\")\n", + "print(f\"Phase 2 - Vector Index-Optimized Search: {gsi_time:.4f} seconds\")\n", "if cache_time_1 and cache_time_2:\n", " print(f\"Phase 3 - Cache Benefits:\")\n", " print(f\" First execution (cache miss): {cache_time_1:.4f} seconds\")\n", @@ -1204,11 +1204,11 @@ " print(f\"Cache Benefit: Variable (depends on query complexity and caching mechanism)\")\n", "\n", "print(f\"\\nKey Insights for Vector Search Performance:\")\n", - "print(f\"• GSI BHIVE indexes provide significant performance improvements for vector similarity search\")\n", - "print(f\"• Performance gains are most dramatic for complex semantic queries\")\n", - "print(f\"• BHIVE optimization is particularly effective for high-dimensional embeddings\")\n", - "print(f\"• Combined with proper quantization (SQ8), GSI delivers production-ready performance\")\n", - "print(f\"• These performance improvements directly benefit any application using the vector store\")" + "print(f\"\u2022 GSI BHIVE indexes provide significant performance improvements for vector similarity search\")\n", + "print(f\"\u2022 Performance gains are most dramatic for complex semantic queries\")\n", + "print(f\"\u2022 BHIVE optimization is particularly effective for high-dimensional embeddings\")\n", + "print(f\"\u2022 Combined with proper quantization (SQ8), GSI delivers production-ready performance\")\n", + "print(f\"\u2022 These performance improvements directly benefit any application using the vector store\")" ] }, { @@ -1276,7 +1276,7 @@ "output_type": "stream", "text": [ "Setting up Jina AI language model for RAG demo...\n", - "✓ JinaChat language model created successfully\n" + "\u2713 JinaChat language model created successfully\n" ] } ], @@ -1284,10 +1284,10 @@ "print(\"Setting up Jina AI language model for RAG demo...\")\n", "try:\n", " llm = JinaChat(temperature=0.1, jinachat_api_key=JINACHAT_API_KEY)\n", - " print(\"✓ JinaChat language model created successfully\")\n", + " print(\"\u2713 JinaChat language model created successfully\")\n", " logging.info(\"Successfully created JinaChat\")\n", "except Exception as e:\n", - " print(f\"✗ Error creating JinaChat: {str(e)}\")\n", + " print(f\"\u2717 Error creating JinaChat: {str(e)}\")\n", " print(\"Please check your JINACHAT_API_KEY and network connection.\")\n", " raise" ] @@ -1319,7 +1319,7 @@ "output_type": "stream", "text": [ "Optimized RAG pipeline created successfully\n", - "Components: GSI BHIVE Vector Search → Context Assembly → Jina Language Model → Response\n" + "Components: GSI BHIVE Vector Search \u2192 Context Assembly \u2192 Jina Language Model \u2192 Response\n" ] } ], @@ -1339,7 +1339,7 @@ " \n", " prompt = ChatPromptTemplate.from_template(template)\n", "\n", - " # Build the RAG chain: GSI-Optimized Retrieval → Context → Generation → Output\n", + " # Build the RAG chain: Vector Index-Optimized Retrieval \u2192 Context \u2192 Generation \u2192 Output\n", " rag_chain = (\n", " {\n", " \"context\": vector_store.as_retriever(search_kwargs={\"k\": 2}), \n", @@ -1350,7 +1350,7 @@ " | StrOutputParser()\n", " )\n", " print(\"Optimized RAG pipeline created successfully\")\n", - " print(\"Components: GSI BHIVE Vector Search → Context Assembly → Jina Language Model → Response\")\n", + " print(\"Components: GSI BHIVE Vector Search \u2192 Context Assembly \u2192 Jina Language Model \u2192 Response\")\n", "except Exception as e:\n", " raise ValueError(f\"Error creating RAG pipeline: {str(e)}\")" ] @@ -1381,7 +1381,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Testing RAG System with GSI-Optimized Vector Search\n", + "Testing RAG System with Vector Index-Optimized Vector Search\n", "============================================================\n", "User Query: What are the new eligibility rules for transgender women competing in leading women's golf tours, and what prompted these changes?\n", "\n", @@ -1398,7 +1398,7 @@ } ], "source": [ - "print(\"Testing RAG System with GSI-Optimized Vector Search\")\n", + "print(\"Testing RAG System with Vector Index-Optimized Vector Search\")\n", "print(\"=\" * 60)\n", "\n", "try:\n", @@ -1423,8 +1423,8 @@ " if \"Payment Required\" in str(e):\n", " print(\"\\nPayment required for Jina AI API.\")\n", " print(\"To resolve:\")\n", - " print(\"• Visit https://jina.ai/reader/#pricing for subscription options\")\n", - " print(\"• Ensure your API key is valid and has sufficient credits\")\n", + " print(\"\u2022 Visit https://jina.ai/reader/#pricing for subscription options\")\n", + " print(\"\u2022 Ensure your API key is valid and has sufficient credits\")\n", " else:\n", " print(f\"Error: {str(e)}\")" ] @@ -1473,9 +1473,9 @@ "- Interacted with carol singers, Christmas shoppers, and stallholders.\n", "- Explored the power station and visited stalls at the Curated Makers Market.\n", "\n", - "✅ RAG demo completed successfully!\n", - "✅ The system leverages GSI BHIVE optimization for fast document retrieval!\n", - "✅ Jina AI provides high-quality embeddings and intelligent response generation!\n" + "\u2705 RAG demo completed successfully!\n", + "\u2705 The system leverages GSI BHIVE optimization for fast document retrieval!\n", + "\u2705 Jina AI provides high-quality embeddings and intelligent response generation!\n" ] } ], @@ -1505,9 +1505,9 @@ " else:\n", " print(f\"Error: {str(e)}\")\n", "\n", - "print(f\"\\n✅ RAG demo completed successfully!\")\n", - "print(\"✅ The system leverages GSI BHIVE optimization for fast document retrieval!\")\n", - "print(\"✅ Jina AI provides high-quality embeddings and intelligent response generation!\")" + "print(f\"\\n\u2705 RAG demo completed successfully!\")\n", + "print(\"\u2705 The system leverages GSI BHIVE optimization for fast document retrieval!\")\n", + "print(\"\u2705 Jina AI provides high-quality embeddings and intelligent response generation!\")" ] }, { @@ -1558,4 +1558,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/jinaai/gsi/frontmatter.md b/jinaai/query_based/frontmatter.md similarity index 100% rename from jinaai/gsi/frontmatter.md rename to jinaai/query_based/frontmatter.md diff --git a/jinaai/fts/.env.sample b/jinaai/search_based/.env.sample similarity index 100% rename from jinaai/fts/.env.sample rename to jinaai/search_based/.env.sample diff --git a/jinaai/search_based/RAG_with_Couchbase_and_Jina_AI.ipynb b/jinaai/search_based/RAG_with_Couchbase_and_Jina_AI.ipynb new file mode 100644 index 00000000..56d96401 --- /dev/null +++ b/jinaai/search_based/RAG_with_Couchbase_and_Jina_AI.ipynb @@ -0,0 +1,1110 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "kNdImxzypDlm" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Jina](https://jina.ai/) as the AI-powered embedding and language model provider, utilizing Full-Text Search (FTS). Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-jina-couchbase-rag-with-hyperscale-or-composite-vector-index)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/jinaai/fts/RAG_with_Couchbase_and_Jina_AI.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for Jina AI\n", + "\n", + "* Please follow the [instructions](https://jina.ai/) to generate the Jina AI credentials.\n", + "* Please follow the [instructions](https://chat.jina.ai/api) to generate the JinaChat credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NH2o6pqa69oG" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and Jina provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "DYhPj0Ta8l_A" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "# Jina doesnt support openai other than 0.27\n", + "%pip install --quiet datasets==3.6.0 langchain-couchbase==0.3.0 langchain-community==0.3.24 openai==0.27 python-dotenv==1.1.0 ipywidgets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1pp7GtNg8mB9" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "8GzS6tfL8mFP" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,\n", + " ServiceUnavailableException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_community.chat_models import JinaChat\n", + "from langchain_community.embeddings import JinaEmbeddings\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from langchain_core.prompts.chat import ChatPromptTemplate\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pBnMp5vb8mIb" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "Yv8kWcuf8mLx" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s',force=True)\n", + "\n", + "# Suppress all logs from specific loggers\n", + "logging.getLogger('openai').setLevel(logging.WARNING)\n", + "logging.getLogger('httpx').setLevel(logging.WARNING)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K9G5a0en8mPA" + }, + "source": [ + "# Loading Sensitive Informnation\n", + "In this section, we prompt the user to input essential configuration settings needed for integrating Couchbase with Cohere's API. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "PFGyHll18mSe" + }, + "outputs": [], + "source": [ + "load_dotenv(\"./.env\") \n", + "\n", + "JINA_API_KEY = os.getenv(\"JINA_API_KEY\")\n", + "JINACHAT_API_KEY = os.getenv(\"JINACHAT_API_KEY\")\n", + "\n", + "CB_HOST = os.getenv(\"CB_HOST\") or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv(\"CB_USERNAME\") or 'Administrator'\n", + "CB_PASSWORD = os.getenv(\"CB_PASSWORD\") or 'password'\n", + "CB_BUCKET_NAME = os.getenv(\"CB_BUCKET_NAME\") or 'vector-search-testing'\n", + "INDEX_NAME = os.getenv(\"INDEX_NAME\") or 'vector_search_jina'\n", + "\n", + "SCOPE_NAME = os.getenv(\"SCOPE_NAME\") or 'shared'\n", + "COLLECTION_NAME = os.getenv(\"COLLECTION_NAME\") or 'jina'\n", + "CACHE_COLLECTION = os.getenv(\"CACHE_COLLECTION\") or 'cache'\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not JINA_API_KEY:\n", + " raise ValueError(\"JINA_API_KEY environment variable is not set\")\n", + "if not JINACHAT_API_KEY:\n", + " raise ValueError(\"JINACHAT_API_KEY environment variable is not set\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtGrYzUY8mV3" + }, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "Zb3kK-7W8mZK" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:45:51,014 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_Gpy32N8mcZ" + }, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Creates primary index on collection for query performance\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "ACZcwUnG8mf2" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:45:56,608 - INFO - Bucket 'vector-search-testing' exists.\n", + "2025-09-23 10:45:59,312 - INFO - Collection 'jina' already exists. Skipping creation.\n", + "2025-09-23 10:46:02,683 - INFO - Primary index present or created successfully.\n", + "2025-09-23 10:46:03,447 - INFO - All documents cleared from the collection.\n", + "2025-09-23 10:46:03,449 - INFO - Bucket 'vector-search-testing' exists.\n", + "2025-09-23 10:46:06,152 - INFO - Collection 'jina_cache' already exists. Skipping creation.\n", + "2025-09-23 10:46:09,482 - INFO - Primary index present or created successfully.\n", + "2025-09-23 10:46:09,804 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Ensure primary index exists\n", + " try:\n", + " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", + " logging.info(\"Primary index present or created successfully.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error creating primary index: {str(e)}\")\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NMJ7RRYp8mjV" + }, + "source": [ + "# Loading Couchbase Vector Search Index\n", + "\n", + "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", + "\n", + "This Jina vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `jina`. The configuration is set up for vectors with exactly `1024 dimensions`, using dot product similarity and optimized for recall. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", + "\n", + "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "y7xiCrOc8mmj" + }, + "outputs": [], + "source": [ + "# If you are running this script locally (not in Google Colab), uncomment the following line\n", + "# and provide the path to your index definition file.\n", + "\n", + "# index_definition_path = '/path_to_your_index_file/jina_index.json' # Local setup: specify your file path here\n", + "\n", + "# # Version for Google Colab\n", + "# def load_index_definition_colab():\n", + "# from google.colab import files\n", + "# print(\"Upload your index definition file\")\n", + "# uploaded = files.upload()\n", + "# index_definition_path = list(uploaded.keys())[0]\n", + "\n", + "# try:\n", + "# with open(index_definition_path, 'r') as file:\n", + "# index_definition = json.load(file)\n", + "# return index_definition\n", + "# except Exception as e:\n", + "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Version for Local Environment\n", + "def load_index_definition_local(index_definition_path):\n", + " try:\n", + " with open(index_definition_path, 'r') as file:\n", + " index_definition = json.load(file)\n", + " return index_definition\n", + " except Exception as e:\n", + " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Usage\n", + "# Uncomment the appropriate line based on your environment\n", + "# index_definition = load_index_definition_colab()\n", + "index_definition = load_index_definition_local('jina_index.json')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v_ddPQ_Y8mpm" + }, + "source": [ + "# Creating or Updating Search Indexes\n", + "\n", + "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "bHEpUu1l8msx" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:47:03,763 - INFO - Index 'vector_search_jina' found\n", + "2025-09-23 10:47:04,742 - INFO - Index 'vector_search_jina' already exists. Skipping creation/update.\n" + ] + } + ], + "source": [ + "try:\n", + " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", + "\n", + " # Check if index already exists\n", + " existing_indexes = scope_index_manager.get_all_indexes()\n", + " index_name = index_definition[\"name\"]\n", + "\n", + " if index_name in [index.name for index in existing_indexes]:\n", + " logging.info(f\"Index '{index_name}' found\")\n", + " else:\n", + " logging.info(f\"Creating new index '{index_name}'...\")\n", + "\n", + " # Create SearchIndex object from JSON definition\n", + " search_index = SearchIndex.from_json(index_definition)\n", + "\n", + " # Upsert the index (create if not exists, update if exists)\n", + " scope_index_manager.upsert_index(search_index)\n", + " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", + "\n", + "except QueryIndexAlreadyExistsException:\n", + " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", + "except ServiceUnavailableException:\n", + " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", + "except InternalServerFailureException as e:\n", + " logging.error(f\"Internal server error: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7FvxRsg38m3G" + }, + "source": [ + "# Creating Jina Embeddings\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using Jina, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "_75ZyCRh8m6m" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:47:06,326 - INFO - Successfully created JinaEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = JinaEmbeddings(\n", + " jina_api_key=JINA_API_KEY, model_name=\"jina-embeddings-v3\"\n", + " )\n", + " logging.info(\"Successfully created JinaEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating JinaEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IwZMUnF8m-N" + }, + "source": [ + "# Setting Up the Couchbase Vector Store\n", + "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "DwIJQjYT9RV_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:47:12,343 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " index_name=INDEX_NAME,\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:47:18,035 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Error Handling: If an error occurs, only the current batch is affected\n", + "3. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "4. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 50 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:50:03,866 - INFO - Document ingestion completed successfully\n" + ] + } + ], + "source": [ + "# Calculate 60% of the dataset size and round to nearest integer\n", + "dataset_size = len(unique_news_articles)\n", + "subset_size = round(dataset_size * 0.6)\n", + "\n", + "# Filter articles by length and create subset\n", + "filtered_articles = [article for article in unique_news_articles[:subset_size] \n", + " if article and len(article) <= 50000]\n", + "\n", + "# Process in batches\n", + "batch_size = 50\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=filtered_articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully\")\n", + " \n", + "except CouchbaseException as e:\n", + " logging.error(f\"Couchbase error during ingestion: {str(e)}\")\n", + " raise RuntimeError(f\"Error performing document ingestion: {str(e)}\")\n", + "except Exception as e:\n", + " if \"Payment Required\" in str(e):\n", + " logging.error(\"Payment required for Jina AI API. Please check your subscription status and API key.\")\n", + " print(\"To resolve this error:\")\n", + " print(\"1. Visit 'https://jina.ai/reader/#pricing' to review subscription options\")\n", + " print(\"2. Ensure your API key is valid and has sufficient credits\") \n", + " print(\"3. Consider upgrading your subscription plan if needed\")\n", + " else:\n", + " logging.error(f\"Unexpected error during ingestion: {str(e)}\")\n", + " raise RuntimeError(f\"Failed to save documents to vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Pn8-dQw9RfQ" + }, + "source": [ + "# Setting Up a Couchbase Cache\n", + "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", + "\n", + "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "V2y7dyjf9Rid" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:50:21,526 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uehAx36o9Rlm" + }, + "source": [ + "# Creating the Jina Language Model (LLM)\n", + "Language models are AI systems that are trained to understand and generate human language. We'll be using Jina's language model to process user queries and generate meaningful responses. This model is a key component of our semantic search engine, allowing it to go beyond simple keyword matching and truly understand the intent behind a query. By creating this language model, we equip our search engine with the ability to interpret complex queries, understand the nuances of language, and provide more accurate and contextually relevant responses.\n", + "\n", + "The language model's ability to understand context and generate coherent responses is what makes our search engine truly intelligent. It can not only find the right information but also present it in a way that is useful and understandable to the user.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "yRAfBRLH9RpO" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:50:22,466 - INFO - Successfully created JinaChat\n" + ] + } + ], + "source": [ + "try:\n", + " llm = JinaChat(temperature=0.1, jinachat_api_key=JINACHAT_API_KEY)\n", + " logging.info(\"Successfully created JinaChat\")\n", + "except Exception as e:\n", + " logging.error(f\"Error creating JinaChat: {str(e)}. Please check your API key and network connection.\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "afOOEECGiLuQ" + }, + "source": [ + "## Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined.\n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the similarity_search_with_score method of the CouchbaseSearchVectorStore. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and a similarity score that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison.\n", + "\n", + "### Note on Retry Mechanism\n", + "The search implementation includes a retry mechanism to handle rate limiting and API errors gracefully. If a rate limit error (HTTP 429) is encountered, the system will automatically retry the request up to 3 times with exponential backoff, waiting 2 seconds initially and doubling the wait time between each retry. This helps manage API usage limits while maintaining service reliability. For other types of errors, such as payment requirements or general failures, appropriate error messages and troubleshooting steps are provided to help diagnose and resolve the issue." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "id": "y3oO33_LiLxU" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:50:25,678 - INFO - Semantic search completed in 2.13 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 2.13 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.6798, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", + "\n", + "Pep Guardiola has not been through a moment like this in his managerial career. Manchester City have lost nine matches in their past 12 - as many defeats as they had suffered in their previous 106 fixtures. At the end of October, City were still unbeaten at the top of the Premier League and favourites to win a fifth successive title. Now they are seventh, 12 points behind leaders Liverpool having played a game more. It has been an incredible fall from grace and left people trying to work out what has happened - and whether Guardiola can make it right. After discussing the situation with those who know him best, I have taken a closer look at the future - both short and long term - and how the current crisis at Man City is going to be solved.\n", + "\n", + "Pep Guardiola's Man City have lost nine of their past 12 matches\n", + "\n", + "Guardiola has also been giving it a lot of thought. He has not been sleeping very well, as he has said, and has not been himself at times when talking to the media. He has been talking to a lot of people about what is going on as he tries to work out the reasons for City's demise. Some reasons he knows, others he still doesn't. What people perhaps do not realise is Guardiola hugely doubts himself and always has. He will be thinking \"I'm not going to be able to get us out of this\" and needs the support of people close to him to push away those insecurities - and he has that. He is protected by his people who are very aware, like he is, that there are a lot of people that want City to fail. It has been a turbulent time for Guardiola. Remember those marks he had on his head after the 3-3 draw with Feyenoord in the Champions League? He always scratches his head, it is a gesture of nervousness. Normally nothing happens but on that day one of his nails was far too sharp so, after talking to the players in the changing room where he scratched his head because of his usual agitated gesturing, he went to the news conference. His right-hand man Manel Estiarte sent him photos in a message saying \"what have you got on your head?\", but by the time Guardiola returned to the coaching room there was hardly anything there again. He started that day with a cover on his nose after the same thing happened at the training ground the day before. Guardiola was having a footballing debate with Kyle Walker about positional stuff and marked his nose with that same nail. There was also that remarkable news conference after the Manchester derby when he said \"I don't know what to do\". That is partly true and partly not true. Ignore the fact Guardiola suggested he was \"not good enough\". He actually meant he was not good enough to resolve the situation with the group of players he has available and with all the other current difficulties. There are obviously logical explanations for the crisis and the first one has been talked about many times - the absence of injured midfielder Rodri. You know the game Jenga? When you take the wrong piece out, the whole tower collapses. That is what has happened here. It is normal for teams to have an over-reliance on one player if he is the best in the world in his position. And you cannot calculate the consequences of an injury that rules someone like Rodri out for the season. City are a team, like many modern ones, in which the holding midfielder is a key element to the construction. So, when you take Rodri out, it is difficult to hold it together. There were Plan Bs - John Stones, Manuel Akanji, even Nathan Ake - but injuries struck. The big injury list has been out of the ordinary and the busy calendar has also played a part in compounding the issues. However, one factor even Guardiola cannot explain is the big uncharacteristic errors in almost every game from international players. Why did Matheus Nunes make that challenge to give away the penalty against Manchester United? Jack Grealish is sent on at the end to keep the ball and cannot do that. There are errors from Walker and other defenders. These are some of the best players in the world. Of course the players' mindset is important, and confidence is diminishing. Wrong decisions get taken so there is almost panic on the pitch instead of calm. There are also players badly out of form who are having to play because of injuries. Walker is now unable to hide behind his pace, I'm not sure Kevin de Bruyne is ever getting back to the level he used to be at, Bernardo Silva and Ilkay Gundogan do not have time to rest, Grealish is not playing at his best. Some of these players were only meant to be playing one game a week but, because of injuries, have played 12 games in 40 days. It all has a domino effect. One consequence is that Erling Haaland isn't getting the service to score. But the Norwegian still remains City's top-scorer with 13. Defender Josko Gvardiol is next on the list with just four. The way their form has been analysed inside the City camp is there have only been three games where they deserved to lose (Liverpool, Bournemouth and Aston Villa). But of course it is time to change the dynamic.\n", + "\n", + "Guardiola has never protected his players so much. He has not criticised them and is not going to do so. They have won everything with him. Instead of doing more with them, he has tried doing less. He has sometimes given them more days off to clear their heads, so they can reset - two days this week for instance. Perhaps the time to change a team is when you are winning, but no-one was suggesting Man City were about to collapse when they were top and unbeaten after nine league games. Some people have asked how bad it has to get before City make a decision on Guardiola. The answer is that there is no decision to be made. Maybe if this was Real Madrid, Barcelona or Juventus, the pressure from outside would be massive and the argument would be made that Guardiola has to go. At City he has won the lot, so how can anyone say he is failing? Yes, this is a crisis. But given all their problems, City's renewed target is finishing in the top four. That is what is in all their heads now. The idea is to recover their essence by improving defensive concepts that are not there and re-establishing the intensity they are known for. Guardiola is planning to use the next two years of his contract, which is expected to be his last as a club manager, to prepare a new Manchester City. When he was at the end of his four years at Barcelona, he asked two managers what to do when you feel people are not responding to your instructions. Do you go or do the players go? Sir Alex Ferguson and Rafael Benitez both told him that the players need to go. Guardiola did not listen because of his emotional attachment to his players back then and he decided to leave the Camp Nou because he felt the cycle was over. He will still protect his players now but there is not the same emotional attachment - so it is the players who are going to leave this time. It is likely City will look to replace five or six regular starters. Guardiola knows it is the end of an era and the start of a new one. Changes will not be immediate and the majority of the work will be done in the summer. But they are open to any opportunities in January - and a holding midfielder is one thing they need. In the summer City might want to get Spain's Martin Zubimendi from Real Sociedad and they know 60m euros (\u00a350m) will get him. He said no to Liverpool last summer even though everything was agreed, but he now wants to move on and the Premier League is the target. Even if they do not get Zubimendi, that is the calibre of footballer they are after. A new Manchester City is on its way - with changes driven by Guardiola, incoming sporting director Hugo Viana and the football department.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.6795, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City\n", + "\n", + "Pep Guardiola has not been through a moment like this in his managerial career. Manchester City have lost nine matches in their past 12 - as many defeats as they had suffered in their previous 106 fixtures. At the end of October, City were still unbeaten at the top of the Premier League and favourites to win a fifth successive title. Now they are seventh, 12 points behind leaders Liverpool having played a game more. It has been an incredible fall from grace and left people trying to work out what has happened - and whether Guardiola can make it right. After discussing the situation with those who know him best, I have taken a closer look at the future - both short and long term - and how the current crisis at Man City is going to be solved.\n", + "\n", + "Pep Guardiola's Man City have lost nine of their past 12 matches\n", + "\n", + "Guardiola has also been giving it a lot of thought. He has not been sleeping very well, as he has said, and has not been himself at times when talking to the media. He has been talking to a lot of people about what is going on as he tries to work out the reasons for City's demise. Some reasons he knows, others he still doesn't. What people perhaps do not realise is Guardiola hugely doubts himself and always has. He will be thinking \"I'm not going to be able to get us out of this\" and needs the support of people close to him to push away those insecurities - and he has that. He is protected by his people who are very aware, like he is, that there are a lot of people that want City to fail. It has been a turbulent time for Guardiola. Remember those marks he had on his head after the 3-3 draw with Feyenoord in the Champions League? He always scratches his head, it is a gesture of nervousness. Normally nothing happens but on that day one of his nails was far too sharp so, after talking to the players in the changing room where he scratched his head because of his usual agitated gesturing, he went to the news conference. His right-hand man Manel Estiarte sent him photos in a message saying \"what have you got on your head?\", but by the time Guardiola returned to the coaching room there was hardly anything there again. He started that day with a cover on his nose after the same thing happened at the training ground the day before. Guardiola was having a footballing debate with Kyle Walker about positional stuff and marked his nose with that same nail. There was also that remarkable news conference after the Manchester derby when he said \"I don't know what to do\". That is partly true and partly not true. Ignore the fact Guardiola suggested he was \"not good enough\". He actually meant he was not good enough to resolve the situation with the group of players he has available and with all the other current difficulties. There are obviously logical explanations for the crisis and the first one has been talked about many times - the absence of injured midfielder Rodri. You know the game Jenga? When you take the wrong piece out, the whole tower collapses. That is what has happened here. It is normal for teams to have an over-reliance on one player if he is the best in the world in his position. And you cannot calculate the consequences of an injury that rules someone like Rodri out for the season. City are a team, like many modern ones, in which the holding midfielder is a key element to the construction. So, when you take Rodri out, it is difficult to hold it together. There were Plan Bs - John Stones, Manuel Akanji, even Nathan Ake - but injuries struck. The big injury list has been out of the ordinary and the busy calendar has also played a part in compounding the issues. However, one factor even Guardiola cannot explain is the big uncharacteristic errors in almost every game from international players. Why did Matheus Nunes make that challenge to give away the penalty against Manchester United? Jack Grealish is sent on at the end to keep the ball and cannot do that. There are errors from Walker and other defenders. These are some of the best players in the world. Of course the players' mindset is important, and confidence is diminishing. Wrong decisions get taken so there is almost panic on the pitch instead of calm. There are also players badly out of form who are having to play because of injuries. Walker is now unable to hide behind his pace, I'm not sure Kevin de Bruyne is ever getting back to the level he used to be at, Bernardo Silva and Ilkay Gundogan do not have time to rest, Grealish is not playing at his best. Some of these players were only meant to be playing one game a week but, because of injuries, have played 12 games in 40 days. It all has a domino effect. One consequence is that Erling Haaland isn't getting the service to score. But the Norwegian still remains City's top-scorer with 13. Defender Josko Gvardiol is next on the list with just four. The way their form has been analysed inside the City camp is there have only been three games where they deserved to lose (Liverpool, Bournemouth and Aston Villa). But of course it is time to change the dynamic.\n", + "\n", + "Guardiola has never protected his players so much. He has not criticised them and is not going to do so. They have won everything with him. Instead of doing more with them, he has tried doing less. He has sometimes given them more days off to clear their heads, so they can reset - two days this week for instance. Perhaps the time to change a team is when you are winning, but no-one was suggesting Man City were about to collapse when they were top and unbeaten after nine league games. Some people have asked how bad it has to get before City make a decision on Guardiola. The answer is that there is no decision to be made. Maybe if this was Real Madrid, Barcelona or Juventus, the pressure from outside would be massive and the argument would be made that Guardiola has to go. At City he has won the lot, so how can anyone say he is failing? Yes, this is a crisis. But given all their problems, City's renewed target is finishing in the top four. That is what is in all their heads now. The idea is to recover their essence by improving defensive concepts that are not there and re-establishing the intensity they are known for. Guardiola is planning to use the next two years of his contract, which is expected to be his last as a club manager, to prepare a new Manchester City. When he was at the end of his four years at Barcelona, he asked two managers what to do when you feel people are not responding to your instructions. Do you go or do the players go? Sir Alex Ferguson and Rafael Benitez both told him that the players need to go. Guardiola did not listen because of his emotional attachment to his players back then and he decided to leave the Camp Nou because he felt the cycle was over. He will still protect his players now but there is not the same emotional attachment - so it is the players who are going to leave this time. It is likely City will look to replace five or six regular starters. Guardiola knows it is the end of an era and the start of a new one. Changes will not be immediate and the majority of the work will be done in the summer. But they are open to any opportunities in January - and a holding midfielder is one thing they need. In the summer City might want to get Spain's Martin Zubimendi from Real Sociedad and they know 60m euros (\u00a350m) will get him. He said no to Liverpool last summer even though everything was agreed, but he now wants to move on and the Premier League is the target. Even if they do not get Zubimendi, that is the calibre of footballer they are after. A new Manchester City is on its way - with changes driven by Guardiola, incoming sporting director Hugo Viana and the football department.\n", + "--------------------------------------------------------------------------------\n", + "Score: 0.6207, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", + "\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "def perform_semantic_search(query, vector_store, max_retries=3, retry_delay=2): \n", + " for attempt in range(max_retries):\n", + " try:\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=5)\n", + " search_elapsed_time = time.time() - start_time\n", + " \n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + " return search_results, search_elapsed_time\n", + " \n", + " except Exception as e:\n", + " error_str = str(e)\n", + " \n", + " # Check if it's a rate limit error (HTTP 429)\n", + " if \"http_status: 429\" in error_str or \"query request rejected\" in error_str:\n", + " logging.warning(f\"Rate limit hit (attempt {attempt+1}/{max_retries}). Waiting {retry_delay} seconds...\")\n", + " time.sleep(retry_delay)\n", + " retry_delay *= 2 # Exponential backoff\n", + " \n", + " if attempt == max_retries - 1:\n", + " logging.error(\"Maximum retry attempts reached. API rate limit exceeded.\")\n", + " raise RuntimeError(\"API rate limit exceeded. Please try again later or check your subscription.\")\n", + " else:\n", + " # For other errors, don't retry\n", + " logging.error(f\"Search error: {error_str}\")\n", + " if \"Payment Required\" in error_str:\n", + " raise RuntimeError(\"Payment required for Jina AI API. Please check your subscription status and API key.\")\n", + " else:\n", + " raise RuntimeError(f\"Search failed: {error_str}\")\n", + "\n", + "try:\n", + " query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + " search_results, search_elapsed_time = perform_semantic_search(query, vector_store)\n", + " \n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\"*80)\n", + " for doc, score in search_results:\n", + " print(f\"Score: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\"*80)\n", + " \n", + "except RuntimeError as e:\n", + " print(f\"Error: {str(e)}\")\n", + " print(\"\\nTroubleshooting steps:\")\n", + " if \"API rate limit\" in str(e):\n", + " print(\"1. Wait a few minutes before trying again\")\n", + " print(\"2. Reduce the frequency of your requests\")\n", + " print(\"3. Consider upgrading your Jina AI plan for higher rate limits\")\n", + " elif \"Payment required\" in str(e):\n", + " print(\"1. Visit 'https://jina.ai/reader/#pricing' to review subscription options\")\n", + " print(\"2. Ensure your API key is valid and has sufficient credits\")\n", + " print(\"3. Update your API key configuration\")\n", + " else:\n", + " print(\"1. Check your network connection\")\n", + " print(\"2. Verify your Couchbase and Jina configurations\")\n", + " print(\"3. Review the vector store implementation for any bugs\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bp8YEEQiL0r" + }, + "source": [ + "# Retrieval-Augmented Generation (RAG) with Couchbase and Langchain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "id": "fTolIHFpiL30" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:50:26,937 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "try:\n", + " template = \"\"\"You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:\n", + " {context}\n", + "\n", + " Question: {question}\"\"\"\n", + " prompt = ChatPromptTemplate.from_template(template)\n", + "\n", + " rag_chain = (\n", + " {\"context\": vector_store.as_retriever(search_kwargs={\"k\": 2}), \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + " )\n", + " logging.info(\"Successfully created RAG chain\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating RAG chain: {str(e)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "id": "6GbtJzTEiL7M" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-23 10:50:47,733 - INFO - RAG response generated in 17.23 seconds using k=2\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: Pep Guardiola has been grappling with self-doubt and seeking support to navigate Manchester City's current crisis.\n", + "Response generated in 17.23 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " # Create chain with k=2\n", + " # Start with k=4 and gradually reduce if token limit exceeded\n", + " # k=4 -> k=3 -> k=2 based on token limit warnings\n", + " # Final k=2 produced valid response about Guardiola in 2.33 seconds\n", + " current_chain = (\n", + " {\n", + " \"context\": vector_store.as_retriever(search_kwargs={\"k\": 2}),\n", + " \"question\": RunnablePassthrough()\n", + " }\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + " )\n", + " \n", + " # Try to get response\n", + " start_time = time.time()\n", + " rag_response = current_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " \n", + " logging.info(f\"RAG response generated in {elapsed_time:.2f} seconds using k=2\")\n", + " print(f\"RAG Response: {rag_response}\")\n", + " print(f\"Response generated in {elapsed_time:.2f} seconds\")\n", + " \n", + "except Exception as e:\n", + " if \"Payment Required\" in str(e):\n", + " logging.error(\"Payment required for Jina AI API. Please check your subscription status and API key.\")\n", + " print(\"To resolve this error:\")\n", + " print(\"1. Visit 'https://jina.ai/reader/#pricing' to review subscription options\")\n", + " print(\"2. Ensure your API key is valid and has sufficient credits\")\n", + " print(\"3. Consider upgrading your subscription plan if needed\")\n", + " else:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T8hCgpMyiL-J" + }, + "source": [ + "# Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "id": "c10Qzeq2Q8N7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: What happened in the match between Fullham and Liverpool?\n", + "Response: Fulham and Liverpool played to a 2-2 draw at Anfield, with both teams showcasing strong performances.\n", + "Time taken: 5.13 seconds\n", + "\n", + "Query 2: What was manchester city manager pep guardiola's reaction to the team's current form?\n", + "Response: Pep Guardiola has been grappling with self-doubt and seeking support to navigate Manchester City's current crisis.\n", + "Time taken: 2.16 seconds\n", + "\n", + "Query 3: What happened in the match between Fullham and Liverpool?\n", + "Response: Fulham and Liverpool played to a 2-2 draw at Anfield, with both teams showcasing strong performances.\n", + "Time taken: 1.95 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"What happened in the match between Fullham and Liverpool?\",\n", + " \"What was manchester city manager pep guardiola's reaction to the team's current form?\", # Repeated query\n", + " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response}\")\n", + " \n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "except Exception as e:\n", + " if \"Payment Required\" in str(e):\n", + " logging.error(\"Payment required for Jina AI API. Please check your subscription status and API key.\")\n", + " print(\"To resolve this error:\")\n", + " print(\"1. Visit 'https://jina.ai/reader/#pricing' to review subscription options\")\n", + " print(\"2. Ensure your API key is valid and has sufficient credits\")\n", + " print(\"3. Consider upgrading your subscription plan if needed\")\n", + " else:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJQ5P8E29go1" + }, + "source": [ + "## Conclusion\n", + "By following these steps, you\u2019ll have a fully functional semantic search engine that leverages the strengths of Couchbase and Jina. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you\u2019re a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/jinaai/fts/frontmatter.md b/jinaai/search_based/frontmatter.md similarity index 100% rename from jinaai/fts/frontmatter.md rename to jinaai/search_based/frontmatter.md diff --git a/jinaai/fts/jina_index.json b/jinaai/search_based/jina_index.json similarity index 100% rename from jinaai/fts/jina_index.json rename to jinaai/search_based/jina_index.json diff --git a/lamaindex/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/lamaindex/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb similarity index 98% rename from lamaindex/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb rename to lamaindex/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index 586042e3..eda8f129 100644 --- a/lamaindex/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/lamaindex/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -16,7 +16,7 @@ "\n", "We leverage Couchbase's Global Secondary Index (GSI) vector search capabilities to create and manage vector indexes, enabling efficient semantic search capabilities. GSI provides high-performance vector search with support for both Hyperscale Vector Indexes (BHIVE) and Composite Vector Indexes, designed to scale to billions of vectors with low memory footprint and optimized concurrent operations.\n", "\n", - "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using OpenAI Services and LlamaIndex with Couchbase's advanced GSI vector search." + "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using OpenAI Services and LlamaIndex with Couchbase's advanced GSI vector search. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html)." ] }, { @@ -262,7 +262,7 @@ "metadata": {}, "source": [ "# Setting Up GSI Vector Search\n", - "In this section, we'll set up the Couchbase vector store using GSI (Global Secondary Index) for high-performance vector search. Unlike FTS-based vector search, GSI vector search provides optimized performance for pure vector similarity operations and can scale to billions of vectors with low memory footprint.\n", + "In this section, we'll set up the Couchbase vector store using Couchbase Hyperscale and Composite Vector Indexes for high-performance vector search. Unlike FTS-based vector search, GSI vector search provides optimized performance for pure vector similarity operations and can scale to billions of vectors with low memory footprint.\n", "\n", "GSI vector search supports two main index types:\n", "- **Hyperscale Vector Indexes (BHIVE)**: Best for pure vector searches with high performance and concurrent operations\n", @@ -609,7 +609,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Optimizing Vector Search with Global Secondary Index (GSI)\n", + "# Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", "\n", "While the above RAG system works effectively, we can significantly improve query performance by leveraging Couchbase's advanced GSI vector search capabilities.\n", "\n", @@ -659,7 +659,7 @@ "\n", "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/server/current/vector-index/hyperscale-vector-index.html#algo_settings).\n", "\n", - "In the code below, we demonstrate creating a BHIVE index for optimal performance. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI. " + "In the code below, we demonstrate creating a BHIVE index for optimal performance. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, Hyperscale and Composite Vector indexes can be created manually from the Couchbase UI. " ] }, { @@ -762,4 +762,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/lamaindex/gsi/frontmatter.md b/lamaindex/query_based/frontmatter.md similarity index 100% rename from lamaindex/gsi/frontmatter.md rename to lamaindex/query_based/frontmatter.md diff --git a/lamaindex/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/lamaindex/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb similarity index 99% rename from lamaindex/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb rename to lamaindex/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index 800f4f78..719e2cd5 100644 --- a/lamaindex/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/lamaindex/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -16,7 +16,7 @@ "\n", "We leverage Couchbase's Full Text Search (FTS) service to create and manage vector indexes, enabling efficient semantic search capabilities. FTS provides the infrastructure for storing, indexing, and querying high-dimensional vector embeddings alongside traditional text search functionality.\n", "\n", - "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using OpenAI Services and LlamaIndex." + "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using OpenAI Services and LlamaIndex. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html)." ] }, { @@ -667,4 +667,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/lamaindex/fts/frontmatter.md b/lamaindex/search_based/frontmatter.md similarity index 100% rename from lamaindex/fts/frontmatter.md rename to lamaindex/search_based/frontmatter.md diff --git a/lamaindex/fts/fts_index.json b/lamaindex/search_based/search_vector_index.json similarity index 100% rename from lamaindex/fts/fts_index.json rename to lamaindex/search_based/search_vector_index.json diff --git a/mistralai/gsi/mistralai.ipynb b/mistralai/gsi/mistralai.ipynb deleted file mode 100644 index a5b36e2a..00000000 --- a/mistralai/gsi/mistralai.ipynb +++ /dev/null @@ -1,800 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction\n", - "\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Mistral AI](https://mistral.ai/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using the FTS, please take a look at [this.](https://developer.couchbase.com/tutorial-mistralai-couchbase-vector-search-with-fts)\n", - "\n", - "Couchbase is a NoSQL distributed document database (JSON) with many of the best features of a relational DBMS: SQL, distributed ACID transactions, and much more. [Couchbase Capella™](https://cloud.couchbase.com/sign-up) is the easiest way to get started, but you can also download and run [Couchbase Server](http://couchbase.com/downloads) on-premises.\n", - "\n", - "Mistral AI is a research lab building the best open source models in the world. La Plateforme enables developers and enterprises to build new products and applications, powered by Mistral's open source and commercial LLMs. \n", - "\n", - "The [Mistral AI APIs](https://console.mistral.ai/) empower LLM applications via:\n", - "\n", - "- [Text generation](https://docs.mistral.ai/capabilities/completion/), enables streaming and provides the ability to display partial model results in real-time\n", - "- [Code generation](https://docs.mistral.ai/capabilities/code_generation/), empowers code generation tasks, including fill-in-the-middle and code completion\n", - "- [Embeddings](https://docs.mistral.ai/capabilities/embeddings/), useful for RAG where it represents the meaning of text as a list of numbers\n", - "- [Function calling](https://docs.mistral.ai/capabilities/function_calling/), enables Mistral models to connect to external tools\n", - "- [Fine-tuning](https://docs.mistral.ai/capabilities/finetuning/), enables developers to create customized and specialized models\n", - "- [JSON mode](https://docs.mistral.ai/capabilities/json_mode/), enables developers to set the response format to json_object\n", - "- [Guardrailing](https://docs.mistral.ai/capabilities/guardrailing/), enables developers to enforce policies at the system level of Mistral models\n", - "\n", - "This tutorial demonstrates how to use Mistral AI's embedding capabilities with Couchbase's **Global Secondary Index (GSI)** for optimized vector search operations. GSI provides superior performance for vector operations compared to traditional search methods, especially for large-scale applications.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for Mistral AI\n", - "\n", - "Please follow the [instructions](https://console.mistral.ai/api-keys/) to generate the Mistral AI credentials.\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "**Note: To run this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0.**\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Install necessary libraries\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install couchbase==4.4.0 mistralai==1.9.10 langchain-couchbase==0.5.0 langchain-core==0.3.76 python-dotenv==1.1.1\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Imports\n" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [], - "source": [ - "from datetime import timedelta\n", - "from mistralai import Mistral\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.options import ClusterOptions\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy, IndexType\n", - "from langchain_core.embeddings import Embeddings\n", - "from typing import List\n", - "from dotenv import load_dotenv\n", - "import os\n", - "import time" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Prerequisites\n" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "\n", - "# Load environment variables from .env file if it exists\n", - "load_dotenv()\n", - "\n", - "# Load from environment variables or prompt for input\n", - "CB_HOST = os.getenv('CB_HOST') or input(\"Cluster URL:\")\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input(\"Couchbase username:\")\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass(\"Couchbase password:\")\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input(\"Couchbase bucket:\")\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input(\"Couchbase scope:\")\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input(\"Couchbase collection:\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Couchbase Connection\n" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [], - "source": [ - "auth = PasswordAuthenticator(\n", - " CB_USERNAME,\n", - " CB_PASSWORD\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [], - "source": [ - "cluster = Cluster(CB_HOST, ClusterOptions(auth))\n", - "cluster.wait_until_ready(timedelta(seconds=5))\n", - "\n", - "bucket = cluster.bucket(CB_BUCKET_NAME)\n", - "scope = bucket.scope(SCOPE_NAME)\n", - "collection = scope.collection(COLLECTION_NAME)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " except Exception as e:\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " bucket_manager.create_scope(scope_name)\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " except Exception as e:\n", - " print(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Creating Mistral AI Embeddings Wrapper\n", - "\n", - "Since Mistral AI doesn't have native LangChain integration, we need to create a custom wrapper class that implements the LangChain Embeddings interface. This will allow us to use Mistral AI's embedding model with Couchbase's GSI vector store.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [], - "source": [ - "class MistralAIEmbeddings(Embeddings):\n", - " \"\"\"Custom Mistral AI Embeddings wrapper for LangChain compatibility.\"\"\"\n", - " \n", - " def __init__(self, api_key: str, model: str = \"mistral-embed\"):\n", - " self.client = Mistral(api_key=api_key)\n", - " self.model = model\n", - " \n", - " def embed_documents(self, texts: List[str]) -> List[List[float]]:\n", - " \"\"\"Embed search docs.\"\"\"\n", - " try:\n", - " response = self.client.embeddings.create(\n", - " model=self.model,\n", - " inputs=texts,\n", - " )\n", - " return [embedding.embedding for embedding in response.data]\n", - " except Exception as e:\n", - " raise ValueError(f\"Error generating embeddings: {str(e)}\")\n", - " \n", - " def embed_query(self, text: str) -> List[float]:\n", - " \"\"\"Embed query text.\"\"\"\n", - " try:\n", - " response = self.client.embeddings.create(\n", - " model=self.model,\n", - " inputs=[text],\n", - " )\n", - " return response.data[0].embedding\n", - " except Exception as e:\n", - " raise ValueError(f\"Error generating query embedding: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Mistral Connection\n" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [], - "source": [ - "MISTRAL_API_KEY = os.getenv('MISTRAL_API_KEY') or getpass.getpass(\"Mistral API Key:\")\n", - "embeddings = MistralAIEmbeddings(api_key=MISTRAL_API_KEY, model=\"mistral-embed\")\n", - "mistral_client = Mistral(api_key=MISTRAL_API_KEY)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Understanding GSI Vector Search\n", - "\n", - "### Optimizing Vector Search with Global Secondary Index (GSI)\n", - "\n", - "With Couchbase 8.0+, you can leverage the power of GSI-based vector search, which offers significant performance improvements over traditional Full-Text Search (FTS) approaches for vector-first workloads. GSI vector search provides high-performance vector similarity search with advanced filtering capabilities and is designed to scale to billions of vectors.\n", - "\n", - "#### GSI vs FTS: Choosing the Right Approach\n", - "\n", - "| Feature | GSI Vector Search | FTS Vector Search |\n", - "| --------------------- | --------------------------------------------------------------- | ----------------------------------------- |\n", - "| **Best For** | Vector-first workloads, complex filtering, high QPS performance| Hybrid search and high recall rates |\n", - "| **Couchbase Version** | 8.0.0+ | 7.6+ |\n", - "| **Filtering** | Pre-filtering with `WHERE` clauses (Composite) or post-filtering (BHIVE) | Pre-filtering with flexible ordering |\n", - "| **Scalability** | Up to billions of vectors (BHIVE) | Up to 10 million vectors |\n", - "| **Performance** | Optimized for concurrent operations with low memory footprint | Good for mixed text and vector queries |\n", - "\n", - "\n", - "#### GSI Vector Index Types\n", - "\n", - "Couchbase offers two distinct GSI vector index types, each optimized for different use cases:\n", - "\n", - "##### Hyperscale Vector Indexes (BHIVE)\n", - "\n", - "- **Best for**: Pure vector searches like content discovery, recommendations, and semantic search\n", - "- **Use when**: You primarily perform vector-only queries without complex scalar filtering\n", - "- **Features**: \n", - " - High performance with low memory footprint\n", - " - Optimized for concurrent operations\n", - " - Designed to scale to billions of vectors\n", - " - Supports post-scan filtering for basic metadata filtering\n", - "\n", - "##### Composite Vector Indexes\n", - "\n", - " - **Best for**: Filtered vector searches that combine vector similarity with scalar value filtering\n", - "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- **Features**: \n", - " - Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", - " - Best for well-defined workloads requiring complex filtering using GSI features\n", - " - Supports range lookups combined with vector search\n", - "\n", - "#### Index Type Selection for This Tutorial\n", - "\n", - "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using GSI. BHIVE is ideal for semantic search scenarios where you want:\n", - "\n", - "1. **High-performance vector search** across large datasets\n", - "2. **Low latency** for real-time applications\n", - "3. **Scalability** to handle growing vector collections\n", - "4. **Concurrent operations** for multi-user environments\n", - "\n", - "The BHIVE index will provide optimal performance for our OpenAI embedding-based semantic search implementation.\n", - "\n", - "#### Alternative: Composite Vector Index\n", - "\n", - "If your use case requires complex filtering with scalar attributes, you may want to consider using a **Composite Vector Index** instead:\n", - "\n", - "```python\n", - "# Alternative: Create a Composite index for filtered searches\n", - "vector_store.create_index(\n", - " index_type=IndexType.COMPOSITE,\n", - " index_description=\"IVF,SQ8\",\n", - " distance_metric=DistanceStrategy.COSINE,\n", - " index_name=\"pydantic_composite_index\",\n", - ")\n", - "```\n", - "\n", - "**Use Composite indexes when:**\n", - "- You need to filter by document metadata or attributes before vector similarity\n", - "- Your queries combine vector search with WHERE clauses\n", - "- You have well-defined filtering requirements that can reduce the search space\n", - "\n", - "**Note**: Composite indexes enable pre-filtering with scalar attributes, making them ideal for applications where you need to search within specific categories, date ranges, or user-specific data segments.\n", - "\n", - "#### Understanding GSI Index Configuration (Couchbase 8.0 Feature)\n", - "\n", - "Before creating our BHIVE index, it's important to understand the configuration parameters that optimize vector storage and search performance. The `index_description` parameter controls how Couchbase optimizes vector storage through centroids and quantization.\n", - "\n", - "##### Index Description Format: `'IVF[],{PQ|SQ}'`\n", - "\n", - "##### Centroids (IVF - Inverted File)\n", - "\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- **More centroids** = faster search, slower training time\n", - "- **Fewer centroids** = slower search, faster training time\n", - "- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size\n", - "\n", - "###### Quantization Options\n", - "\n", - "**Scalar Quantization (SQ):**\n", - "- `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)\n", - "- Lower memory usage, faster search, slightly reduced accuracy\n", - "\n", - "**Product Quantization (PQ):**\n", - "- Format: `PQx` (e.g., `PQ32x8`)\n", - "- Better compression for very large datasets\n", - "- More complex but can maintain accuracy with smaller index size\n", - "\n", - "##### Common Configuration Examples\n", - "\n", - "- **`IVF,SQ8`** - Auto centroids, 8-bit scalar quantization (good default)\n", - "- **`IVF1000,SQ6`** - 1000 centroids, 6-bit scalar quantization\n", - "- **`IVF,PQ32x8`** - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", - "\n", - "##### Our Configuration Choice\n", - "\n", - "In this tutorial, we use `IVF,SQ8` which provides:\n", - "- **Auto-selected centroids** optimized for our dataset size\n", - "- **8-bit scalar quantization** for good balance of speed, memory usage, and accuracy\n", - "- **COSINE distance metric** ideal for semantic similarity search\n", - "- **Optimal performance** for most semantic search use cases" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setting Up Couchbase GSI Vector Store\n", - "\n", - "Instead of using FTS (Full-Text Search), we'll use Couchbase's GSI (Global Secondary Index) for vector operations. GSI provides better performance for vector search operations and supports advanced index types like BHIVE and COMPOSITE indexes.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "GSI Vector Store created successfully!\n" - ] - } - ], - "source": [ - "vector_store = CouchbaseQueryVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " distance_metric=DistanceStrategy.COSINE\n", - ")\n", - "\n", - "print(\"GSI Vector Store created successfully!\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Embedding Documents\n", - "\n", - "Mistral client can be used to generate vector embeddings for given text fragments. These embeddings represent the sentiment of corresponding fragments and can be stored in Couchbase for further retrieval. A custom embedding text can also be added into the embedding texts array by running this code block:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-11-07 15:50:09,439 - INFO - HTTP Request: POST https://api.mistral.ai/v1/embeddings \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Documents added to GSI vector store successfully!\n" - ] - } - ], - "source": [ - "texts = [\n", - " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\",\n", - " \"It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", - " input(\"custom embedding text\")\n", - "]\n", - "\n", - "# Store documents in the GSI vector store\n", - "vector_store.add_texts(texts)\n", - "\n", - "print(\"Documents added to GSI vector store successfully!\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Understanding Semantic Search in Couchbase\n", - "\n", - "Semantic search goes beyond traditional keyword matching by understanding the meaning and context behind queries. Here's how it works in Couchbase:\n", - "\n", - "## How Semantic Search Works\n", - "\n", - "1. **Vector Embeddings**: Documents and queries are converted into high-dimensional vectors using an embeddings model (in our case, Mistral AI's mistral-embed)\n", - "\n", - "2. **Similarity Calculation**: When a query is made, Couchbase compares the query vector against stored document vectors using the COSINE distance metric\n", - "\n", - "3. **Result Ranking**: Documents are ranked by their vector distance (lower distance = more similar meaning)\n", - "\n", - "4. **Flexible Configuration**: Different distance metrics (cosine, euclidean, dot product) and embedding models can be used based on your needs\n", - "\n", - "The `similarity_search_with_score` method performs this entire process, returning documents along with their similarity scores. This enables you to find semantically related content even when exact keywords don't match.\n", - "\n", - "Now let's see semantic search in action and measure its performance with different optimization strategies.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Vector Search Performance Optimization\n", - "\n", - "Now let's measure and compare the performance benefits of different optimization strategies. We'll conduct a comprehensive performance analysis across two phases:\n", - "\n", - "## Performance Testing Phases\n", - "\n", - "1. **Phase 1 - Baseline Performance**: Test vector search without GSI indexes to establish baseline metrics\n", - "\n", - "2. **Phase 2 - GSI-Optimized Search**: Create BHIVE index and measure performance improvements\n", - "\n", - "**Important Context:**\n", - "\n", - "- GSI performance benefits scale with dataset size and concurrent load\n", - "- With our dataset (~3 documents), improvements may be modest\n", - "- Production environments with millions of vectors show significant GSI advantages\n", - "- The combination of GSI + embeddings provides optimal semantic search performance\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Phase 1: Baseline Performance (Without GSI Index)\n", - "\n", - "First, let's test the search performance without any GSI indexes. This will help us establish a baseline for comparison.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "\n", - "# Configure logging\n", - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n", - "\n", - "# Phase 1: Baseline Performance (Without GSI Index)\n", - "print(\"=\"*80)\n", - "print(\"PHASE 1: BASELINE PERFORMANCE (NO GSI INDEX)\")\n", - "print(\"=\"*80)\n", - "\n", - "query = \"name a multipurpose database with distributed capability\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=3)\n", - " baseline_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Baseline search completed in {baseline_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nBaseline Search Results (completed in {baseline_time:.4f} seconds):\")\n", - " print(\"-\" * 80)\n", - " for i, (doc, distance) in enumerate(search_results, 1):\n", - " print(f\"[Result {i}] Vector Distance: {distance:.4f}\")\n", - " # Truncate for readability\n", - " content_preview = doc.page_content[:150] + \"...\" if len(doc.page_content) > 150 else doc.page_content\n", - " print(f\"Text: {content_preview}\")\n", - " print(\"-\" * 80)\n", - "\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Phase 2: GSI-Optimized Performance (With BHIVE Index)\n", - "\n", - "Now let's create a BHIVE index and measure the performance improvements when searching with GSI optimization.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create a BHIVE index for optimal vector search performance\n", - "print(\"\\nCreating BHIVE index for GSI optimization...\")\n", - "vector_store.create_index(\n", - " index_type=IndexType.BHIVE, \n", - " index_name=\"mistral_bhive_index_optimized\",\n", - " index_description=\"IVF,SQ8\"\n", - ")\n", - "print(\"BHIVE index created successfully!\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: To create a COMPOSITE index, the below code can be used.\n", - "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles.\n", - "\n", - "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"pydantic_ai_composite_index\", index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Phase 2: GSI-Optimized Performance (With BHIVE Index)\n", - "print(\"\\n\" + \"=\"*80)\n", - "print(\"PHASE 2: GSI-OPTIMIZED PERFORMANCE (WITH BHIVE INDEX)\")\n", - "print(\"=\"*80)\n", - "\n", - "query = \"name a multipurpose database with distributed capability\"\n", - "\n", - "try:\n", - " # Perform the semantic search with GSI\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=3)\n", - " gsi_time = time.time() - start_time\n", - "\n", - " logging.info(f\"GSI-optimized search completed in {gsi_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nGSI-Optimized Search Results (completed in {gsi_time:.4f} seconds):\")\n", - " print(\"-\" * 80)\n", - " for i, (doc, distance) in enumerate(search_results, 1):\n", - " print(f\"[Result {i}] Vector Distance: {distance:.4f}\")\n", - " # Truncate for readability\n", - " content_preview = doc.page_content[:150] + \"...\" if len(doc.page_content) > 150 else doc.page_content\n", - " print(f\"Text: {content_preview}\")\n", - " print(\"-\" * 80)\n", - "\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Performance Summary\n", - "\n", - "Let's analyze the performance improvements achieved through GSI optimization.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"\\n\" + \"=\"*80)\n", - "print(\"VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", - "print(\"=\"*80)\n", - "\n", - "print(f\"\\n📊 Performance Comparison:\")\n", - "print(f\"{'Optimization Level':<35} {'Time (seconds)':<20} {'Status'}\")\n", - "print(\"-\" * 80)\n", - "print(f\"{'Phase 1 - Baseline (No Index)':<35} {baseline_time:.4f}{'':16} ⚪ Baseline\")\n", - "print(f\"{'Phase 2 - GSI-Optimized (BHIVE)':<35} {gsi_time:.4f}{'':16} ✅ Optimized\")\n", - "\n", - "# Calculate improvement\n", - "if baseline_time > gsi_time:\n", - " speedup = baseline_time / gsi_time\n", - " improvement = ((baseline_time - gsi_time) / baseline_time) * 100\n", - " print(f\"\\n✨ GSI Performance Gain: {speedup:.2f}x faster ({improvement:.1f}% improvement)\")\n", - "elif gsi_time > baseline_time:\n", - " slowdown_pct = ((gsi_time - baseline_time) / baseline_time) * 100\n", - " print(f\"\\n⚠️ Note: GSI was {slowdown_pct:.1f}% slower than baseline in this run\")\n", - " print(f\" This can happen with small datasets. GSI benefits emerge with scale.\")\n", - "else:\n", - " print(f\"\\n⚖️ Performance: Comparable to baseline\")\n", - "\n", - "print(\"\\n\" + \"-\"*80)\n", - "print(\"KEY INSIGHTS:\")\n", - "print(\"-\"*80)\n", - "print(\"1. 🚀 GSI Optimization:\")\n", - "print(\" • BHIVE indexes excel with large-scale datasets (millions+ vectors)\")\n", - "print(\" • Performance gains increase with dataset size and concurrent queries\")\n", - "print(\" • Optimal for production workloads with sustained traffic patterns\")\n", - "\n", - "print(\"\\n2. 📦 Dataset Size Impact:\")\n", - "print(f\" • Current dataset: ~3 sample documents\")\n", - "print(\" • At this scale, performance differences may be minimal or variable\")\n", - "print(\" • Significant gains typically seen with 10M+ vectors\")\n", - "\n", - "print(\"\\n3. 🎯 When to Use GSI:\")\n", - "print(\" • Large-scale vector search applications\")\n", - "print(\" • High query-per-second (QPS) requirements\")\n", - "print(\" • Multi-user concurrent access scenarios\")\n", - "print(\" • Production environments requiring scalability\")\n", - "\n", - "print(\"\\n\" + \"=\"*80)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Conclusion\n", - "\n", - "This tutorial demonstrated how to use Mistral AI's embedding capabilities with Couchbase's GSI vector search, including comprehensive performance analysis. Key takeaways include:\n", - "\n", - "## What We Covered\n", - "\n", - "1. **Semantic Search Fundamentals**: Understanding how vector embeddings enable meaning-based search\n", - "2. **Mistral AI Integration**: Creating a custom LangChain wrapper for Mistral AI's powerful mistral-embed model\n", - "3. **Performance Testing**: Conducting baseline vs GSI-optimized performance comparisons\n", - "4. **GSI Index Types**: Understanding BHIVE (pure vector search) and COMPOSITE (filtered searches) indexes\n", - "5. **Index Configuration**: Learning about centroids, quantization, and optimization settings\n", - "\n", - "## Key Benefits of This Approach\n", - "\n", - "1. **High-Performance Vector Search**: GSI provides optimized vector operations with low latency\n", - "2. **Scalability**: BHIVE indexes designed to handle billions of vectors efficiently\n", - "3. **Production-Ready**: Optimal for applications requiring high QPS and concurrent access\n", - "4. **Flexible Configuration**: Customizable index settings for different use cases\n", - "5. **Advanced Filtering**: COMPOSITE indexes enable complex scalar + vector queries\n", - "\n", - "## Performance Insights\n", - "\n", - "- GSI benefits scale with dataset size and query load\n", - "- Small datasets may show modest improvements\n", - "- Production environments with millions of vectors see significant performance gains\n", - "- Consider your specific use case when choosing between BHIVE and COMPOSITE indexes\n", - "\n", - "## Next Steps\n", - "\n", - "- Scale your dataset to explore GSI performance at higher volumes\n", - "- Experiment with different index configurations (IVF centroids, quantization settings)\n", - "- Try COMPOSITE indexes for filtered search scenarios\n", - "- Integrate this solution into your production RAG or semantic search applications\n", - "\n", - "The combination of Mistral AI's embeddings and Couchbase's GSI vector search provides a powerful, scalable foundation for building intelligent search applications.\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/mistralai/fts/.env.sample b/mistralai/query_based/.env.sample similarity index 100% rename from mistralai/fts/.env.sample rename to mistralai/query_based/.env.sample diff --git a/mistralai/gsi/frontmatter.md b/mistralai/query_based/frontmatter.md similarity index 100% rename from mistralai/gsi/frontmatter.md rename to mistralai/query_based/frontmatter.md diff --git a/mistralai/query_based/mistralai.ipynb b/mistralai/query_based/mistralai.ipynb new file mode 100644 index 00000000..e5396bf6 --- /dev/null +++ b/mistralai/query_based/mistralai.ipynb @@ -0,0 +1,800 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction\n", + "\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Mistral AI](https://mistral.ai/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively, if you want to perform semantic search using the FTS, please take a look at [this.](https://developer.couchbase.com/tutorial-mistralai-couchbase-vector-search-with-search-vector-index)\n", + "\n", + "Couchbase is a NoSQL distributed document database (JSON) with many of the best features of a relational DBMS: SQL, distributed ACID transactions, and much more. [Couchbase Capella\u2122](https://cloud.couchbase.com/sign-up) is the easiest way to get started, but you can also download and run [Couchbase Server](http://couchbase.com/downloads) on-premises.\n", + "\n", + "Mistral AI is a research lab building the best open source models in the world. La Plateforme enables developers and enterprises to build new products and applications, powered by Mistral's open source and commercial LLMs. \n", + "\n", + "The [Mistral AI APIs](https://console.mistral.ai/) empower LLM applications via:\n", + "\n", + "- [Text generation](https://docs.mistral.ai/capabilities/completion/), enables streaming and provides the ability to display partial model results in real-time\n", + "- [Code generation](https://docs.mistral.ai/capabilities/code_generation/), empowers code generation tasks, including fill-in-the-middle and code completion\n", + "- [Embeddings](https://docs.mistral.ai/capabilities/embeddings/), useful for RAG where it represents the meaning of text as a list of numbers\n", + "- [Function calling](https://docs.mistral.ai/capabilities/function_calling/), enables Mistral models to connect to external tools\n", + "- [Fine-tuning](https://docs.mistral.ai/capabilities/finetuning/), enables developers to create customized and specialized models\n", + "- [JSON mode](https://docs.mistral.ai/capabilities/json_mode/), enables developers to set the response format to json_object\n", + "- [Guardrailing](https://docs.mistral.ai/capabilities/guardrailing/), enables developers to enforce policies at the system level of Mistral models\n", + "\n", + "This tutorial demonstrates how to use Mistral AI's embedding capabilities with Couchbase's **Global Secondary Index (GSI)** for optimized vector search operations. GSI provides superior performance for vector operations compared to traditional search methods, especially for large-scale applications.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for Mistral AI\n", + "\n", + "Please follow the [instructions](https://console.mistral.ai/api-keys/) to generate the Mistral AI credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "**Note: To run this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0.**\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Install necessary libraries\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install couchbase==4.4.0 mistralai==1.9.10 langchain-couchbase==0.5.0 langchain-core==0.3.76 python-dotenv==1.1.1\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Imports\n" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import timedelta\n", + "from mistralai import Mistral\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.options import ClusterOptions\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy, IndexType\n", + "from langchain_core.embeddings import Embeddings\n", + "from typing import List\n", + "from dotenv import load_dotenv\n", + "import os\n", + "import time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Prerequisites\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "\n", + "# Load environment variables from .env file if it exists\n", + "load_dotenv()\n", + "\n", + "# Load from environment variables or prompt for input\n", + "CB_HOST = os.getenv('CB_HOST') or input(\"Cluster URL:\")\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input(\"Couchbase username:\")\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass(\"Couchbase password:\")\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input(\"Couchbase bucket:\")\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input(\"Couchbase scope:\")\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input(\"Couchbase collection:\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Couchbase Connection\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "auth = PasswordAuthenticator(\n", + " CB_USERNAME,\n", + " CB_PASSWORD\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "cluster = Cluster(CB_HOST, ClusterOptions(auth))\n", + "cluster.wait_until_ready(timedelta(seconds=5))\n", + "\n", + "bucket = cluster.bucket(CB_BUCKET_NAME)\n", + "scope = bucket.scope(SCOPE_NAME)\n", + "collection = scope.collection(COLLECTION_NAME)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " except Exception as e:\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " bucket_manager.create_scope(scope_name)\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " except Exception as e:\n", + " print(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Creating Mistral AI Embeddings Wrapper\n", + "\n", + "Since Mistral AI doesn't have native LangChain integration, we need to create a custom wrapper class that implements the LangChain Embeddings interface. This will allow us to use Mistral AI's embedding model with Couchbase's GSI vector store.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "class MistralAIEmbeddings(Embeddings):\n", + " \"\"\"Custom Mistral AI Embeddings wrapper for LangChain compatibility.\"\"\"\n", + " \n", + " def __init__(self, api_key: str, model: str = \"mistral-embed\"):\n", + " self.client = Mistral(api_key=api_key)\n", + " self.model = model\n", + " \n", + " def embed_documents(self, texts: List[str]) -> List[List[float]]:\n", + " \"\"\"Embed search docs.\"\"\"\n", + " try:\n", + " response = self.client.embeddings.create(\n", + " model=self.model,\n", + " inputs=texts,\n", + " )\n", + " return [embedding.embedding for embedding in response.data]\n", + " except Exception as e:\n", + " raise ValueError(f\"Error generating embeddings: {str(e)}\")\n", + " \n", + " def embed_query(self, text: str) -> List[float]:\n", + " \"\"\"Embed query text.\"\"\"\n", + " try:\n", + " response = self.client.embeddings.create(\n", + " model=self.model,\n", + " inputs=[text],\n", + " )\n", + " return response.data[0].embedding\n", + " except Exception as e:\n", + " raise ValueError(f\"Error generating query embedding: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mistral Connection\n" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "MISTRAL_API_KEY = os.getenv('MISTRAL_API_KEY') or getpass.getpass(\"Mistral API Key:\")\n", + "embeddings = MistralAIEmbeddings(api_key=MISTRAL_API_KEY, model=\"mistral-embed\")\n", + "mistral_client = Mistral(api_key=MISTRAL_API_KEY)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Understanding GSI Vector Search\n", + "\n", + "### Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", + "\n", + "With Couchbase 8.0+, you can leverage the power of GSI-based vector search, which offers significant performance improvements over traditional Full-Text Search (FTS) approaches for vector-first workloads. GSI vector search provides high-performance vector similarity search with advanced filtering capabilities and is designed to scale to billions of vectors.\n", + "\n", + "#### GSI vs FTS: Choosing the Right Approach\n", + "\n", + "| Feature | GSI Vector Search | FTS Vector Search |\n", + "| --------------------- | --------------------------------------------------------------- | ----------------------------------------- |\n", + "| **Best For** | Vector-first workloads, complex filtering, high QPS performance| Hybrid search and high recall rates |\n", + "| **Couchbase Version** | 8.0.0+ | 7.6+ |\n", + "| **Filtering** | Pre-filtering with `WHERE` clauses (Composite) or post-filtering (BHIVE) | Pre-filtering with flexible ordering |\n", + "| **Scalability** | Up to billions of vectors (BHIVE) | Up to 10 million vectors |\n", + "| **Performance** | Optimized for concurrent operations with low memory footprint | Good for mixed text and vector queries |\n", + "\n", + "\n", + "#### GSI Vector Index Types\n", + "\n", + "Couchbase offers two distinct GSI vector index types, each optimized for different use cases:\n", + "\n", + "##### Hyperscale Vector Indexes (BHIVE)\n", + "\n", + "- **Best for**: Pure vector searches like content discovery, recommendations, and semantic search\n", + "- **Use when**: You primarily perform vector-only queries without complex scalar filtering\n", + "- **Features**: \n", + " - High performance with low memory footprint\n", + " - Optimized for concurrent operations\n", + " - Designed to scale to billions of vectors\n", + " - Supports post-scan filtering for basic metadata filtering\n", + "\n", + "##### Composite Vector Indexes\n", + "\n", + " - **Best for**: Filtered vector searches that combine vector similarity with scalar value filtering\n", + "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- **Features**: \n", + " - Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", + " - Best for well-defined workloads requiring complex filtering using Hyperscale and Composite Vector Index features\n", + " - Supports range lookups combined with vector search\n", + "\n", + "#### Index Type Selection for This Tutorial\n", + "\n", + "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using Hyperscale and Composite Vector Indexes. BHIVE is ideal for semantic search scenarios where you want:\n", + "\n", + "1. **High-performance vector search** across large datasets\n", + "2. **Low latency** for real-time applications\n", + "3. **Scalability** to handle growing vector collections\n", + "4. **Concurrent operations** for multi-user environments\n", + "\n", + "The BHIVE index will provide optimal performance for our OpenAI embedding-based semantic search implementation.\n", + "\n", + "#### Alternative: Composite Vector Index\n", + "\n", + "If your use case requires complex filtering with scalar attributes, you may want to consider using a **Composite Vector Index** instead:\n", + "\n", + "```python\n", + "# Alternative: Create a Composite index for filtered searches\n", + "vector_store.create_index(\n", + " index_type=IndexType.COMPOSITE,\n", + " index_description=\"IVF,SQ8\",\n", + " distance_metric=DistanceStrategy.COSINE,\n", + " index_name=\"pydantic_composite_index\",\n", + ")\n", + "```\n", + "\n", + "**Use Composite indexes when:**\n", + "- You need to filter by document metadata or attributes before vector similarity\n", + "- Your queries combine vector search with WHERE clauses\n", + "- You have well-defined filtering requirements that can reduce the search space\n", + "\n", + "**Note**: Composite indexes enable pre-filtering with scalar attributes, making them ideal for applications where you need to search within specific categories, date ranges, or user-specific data segments.\n", + "\n", + "#### Understanding GSI Index Configuration (Couchbase 8.0 Feature)\n", + "\n", + "Before creating our BHIVE index, it's important to understand the configuration parameters that optimize vector storage and search performance. The `index_description` parameter controls how Couchbase optimizes vector storage through centroids and quantization.\n", + "\n", + "##### Index Description Format: `'IVF[],{PQ|SQ}'`\n", + "\n", + "##### Centroids (IVF - Inverted File)\n", + "\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- **More centroids** = faster search, slower training time\n", + "- **Fewer centroids** = slower search, faster training time\n", + "- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size\n", + "\n", + "###### Quantization Options\n", + "\n", + "**Scalar Quantization (SQ):**\n", + "- `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)\n", + "- Lower memory usage, faster search, slightly reduced accuracy\n", + "\n", + "**Product Quantization (PQ):**\n", + "- Format: `PQx` (e.g., `PQ32x8`)\n", + "- Better compression for very large datasets\n", + "- More complex but can maintain accuracy with smaller index size\n", + "\n", + "##### Common Configuration Examples\n", + "\n", + "- **`IVF,SQ8`** - Auto centroids, 8-bit scalar quantization (good default)\n", + "- **`IVF1000,SQ6`** - 1000 centroids, 6-bit scalar quantization\n", + "- **`IVF,PQ32x8`** - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", + "\n", + "##### Our Configuration Choice\n", + "\n", + "In this tutorial, we use `IVF,SQ8` which provides:\n", + "- **Auto-selected centroids** optimized for our dataset size\n", + "- **8-bit scalar quantization** for good balance of speed, memory usage, and accuracy\n", + "- **COSINE distance metric** ideal for semantic similarity search\n", + "- **Optimal performance** for most semantic search use cases" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting Up Couchbase GSI Vector Store\n", + "\n", + "Instead of using FTS (Full-Text Search), we'll use Couchbase's GSI (Global Secondary Index) for vector operations. GSI provides better performance for vector search operations and supports advanced index types like BHIVE and COMPOSITE indexes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GSI Vector Store created successfully!\n" + ] + } + ], + "source": [ + "vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " distance_metric=DistanceStrategy.COSINE\n", + ")\n", + "\n", + "print(\"GSI Vector Store created successfully!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Embedding Documents\n", + "\n", + "Mistral client can be used to generate vector embeddings for given text fragments. These embeddings represent the sentiment of corresponding fragments and can be stored in Couchbase for further retrieval. A custom embedding text can also be added into the embedding texts array by running this code block:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-11-07 15:50:09,439 - INFO - HTTP Request: POST https://api.mistral.ai/v1/embeddings \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Documents added to GSI vector store successfully!\n" + ] + } + ], + "source": [ + "texts = [\n", + " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\",\n", + " \"It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", + " input(\"custom embedding text\")\n", + "]\n", + "\n", + "# Store documents in the GSI vector store\n", + "vector_store.add_texts(texts)\n", + "\n", + "print(\"Documents added to GSI vector store successfully!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Understanding Semantic Search in Couchbase\n", + "\n", + "Semantic search goes beyond traditional keyword matching by understanding the meaning and context behind queries. Here's how it works in Couchbase:\n", + "\n", + "## How Semantic Search Works\n", + "\n", + "1. **Vector Embeddings**: Documents and queries are converted into high-dimensional vectors using an embeddings model (in our case, Mistral AI's mistral-embed)\n", + "\n", + "2. **Similarity Calculation**: When a query is made, Couchbase compares the query vector against stored document vectors using the COSINE distance metric\n", + "\n", + "3. **Result Ranking**: Documents are ranked by their vector distance (lower distance = more similar meaning)\n", + "\n", + "4. **Flexible Configuration**: Different distance metrics (cosine, euclidean, dot product) and embedding models can be used based on your needs\n", + "\n", + "The `similarity_search_with_score` method performs this entire process, returning documents along with their similarity scores. This enables you to find semantically related content even when exact keywords don't match.\n", + "\n", + "Now let's see semantic search in action and measure its performance with different optimization strategies.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Vector Search Performance Optimization\n", + "\n", + "Now let's measure and compare the performance benefits of different optimization strategies. We'll conduct a comprehensive performance analysis across two phases:\n", + "\n", + "## Performance Testing Phases\n", + "\n", + "1. **Phase 1 - Baseline Performance**: Test vector search without Hyperscale or Composite Vector Indexes to establish baseline metrics\n", + "\n", + "2. **Phase 2 - Vector Index-Optimized Search**: Create BHIVE index and measure performance improvements\n", + "\n", + "**Important Context:**\n", + "\n", + "- GSI performance benefits scale with dataset size and concurrent load\n", + "- With our dataset (~3 documents), improvements may be modest\n", + "- Production environments with millions of vectors show significant GSI advantages\n", + "- The combination of GSI + embeddings provides optimal semantic search performance\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Phase 1: Baseline Performance (Without GSI Index)\n", + "\n", + "First, let's test the search performance without any GSI indexes. This will help us establish a baseline for comparison.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "\n", + "# Configure logging\n", + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n", + "\n", + "# Phase 1: Baseline Performance (Without GSI Index)\n", + "print(\"=\"*80)\n", + "print(\"PHASE 1: BASELINE PERFORMANCE (NO GSI INDEX)\")\n", + "print(\"=\"*80)\n", + "\n", + "query = \"name a multipurpose database with distributed capability\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=3)\n", + " baseline_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Baseline search completed in {baseline_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nBaseline Search Results (completed in {baseline_time:.4f} seconds):\")\n", + " print(\"-\" * 80)\n", + " for i, (doc, distance) in enumerate(search_results, 1):\n", + " print(f\"[Result {i}] Vector Distance: {distance:.4f}\")\n", + " # Truncate for readability\n", + " content_preview = doc.page_content[:150] + \"...\" if len(doc.page_content) > 150 else doc.page_content\n", + " print(f\"Text: {content_preview}\")\n", + " print(\"-\" * 80)\n", + "\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Phase 2: Vector Index-Optimized Performance (With BHIVE Index)\n", + "\n", + "Now let's create a BHIVE index and measure the performance improvements when searching with GSI optimization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create a BHIVE index for optimal vector search performance\n", + "print(\"\\nCreating BHIVE index for GSI optimization...\")\n", + "vector_store.create_index(\n", + " index_type=IndexType.BHIVE, \n", + " index_name=\"mistral_bhive_index_optimized\",\n", + " index_description=\"IVF,SQ8\"\n", + ")\n", + "print(\"BHIVE index created successfully!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: To create a COMPOSITE index, the below code can be used.\n", + "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles.\n", + "\n", + "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"pydantic_ai_composite_index\", index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Phase 2: Vector Index-Optimized Performance (With BHIVE Index)\n", + "print(\"\\n\" + \"=\"*80)\n", + "print(\"PHASE 2: GSI-OPTIMIZED PERFORMANCE (WITH BHIVE INDEX)\")\n", + "print(\"=\"*80)\n", + "\n", + "query = \"name a multipurpose database with distributed capability\"\n", + "\n", + "try:\n", + " # Perform the semantic search with GSI\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=3)\n", + " gsi_time = time.time() - start_time\n", + "\n", + " logging.info(f\"GSI-optimized search completed in {gsi_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nVector Index-Optimized Search Results (completed in {gsi_time:.4f} seconds):\")\n", + " print(\"-\" * 80)\n", + " for i, (doc, distance) in enumerate(search_results, 1):\n", + " print(f\"[Result {i}] Vector Distance: {distance:.4f}\")\n", + " # Truncate for readability\n", + " content_preview = doc.page_content[:150] + \"...\" if len(doc.page_content) > 150 else doc.page_content\n", + " print(f\"Text: {content_preview}\")\n", + " print(\"-\" * 80)\n", + "\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Performance Summary\n", + "\n", + "Let's analyze the performance improvements achieved through GSI optimization.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n\" + \"=\"*80)\n", + "print(\"VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", + "print(\"=\"*80)\n", + "\n", + "print(f\"\\n\ud83d\udcca Performance Comparison:\")\n", + "print(f\"{'Optimization Level':<35} {'Time (seconds)':<20} {'Status'}\")\n", + "print(\"-\" * 80)\n", + "print(f\"{'Phase 1 - Baseline (No Index)':<35} {baseline_time:.4f}{'':16} \u26aa Baseline\")\n", + "print(f\"{'Phase 2 - Vector Index-Optimized (BHIVE)':<35} {gsi_time:.4f}{'':16} \u2705 Optimized\")\n", + "\n", + "# Calculate improvement\n", + "if baseline_time > gsi_time:\n", + " speedup = baseline_time / gsi_time\n", + " improvement = ((baseline_time - gsi_time) / baseline_time) * 100\n", + " print(f\"\\n\u2728 GSI Performance Gain: {speedup:.2f}x faster ({improvement:.1f}% improvement)\")\n", + "elif gsi_time > baseline_time:\n", + " slowdown_pct = ((gsi_time - baseline_time) / baseline_time) * 100\n", + " print(f\"\\n\u26a0\ufe0f Note: GSI was {slowdown_pct:.1f}% slower than baseline in this run\")\n", + " print(f\" This can happen with small datasets. GSI benefits emerge with scale.\")\n", + "else:\n", + " print(f\"\\n\u2696\ufe0f Performance: Comparable to baseline\")\n", + "\n", + "print(\"\\n\" + \"-\"*80)\n", + "print(\"KEY INSIGHTS:\")\n", + "print(\"-\"*80)\n", + "print(\"1. \ud83d\ude80 GSI Optimization:\")\n", + "print(\" \u2022 BHIVE indexes excel with large-scale datasets (millions+ vectors)\")\n", + "print(\" \u2022 Performance gains increase with dataset size and concurrent queries\")\n", + "print(\" \u2022 Optimal for production workloads with sustained traffic patterns\")\n", + "\n", + "print(\"\\n2. \ud83d\udce6 Dataset Size Impact:\")\n", + "print(f\" \u2022 Current dataset: ~3 sample documents\")\n", + "print(\" \u2022 At this scale, performance differences may be minimal or variable\")\n", + "print(\" \u2022 Significant gains typically seen with 10M+ vectors\")\n", + "\n", + "print(\"\\n3. \ud83c\udfaf When to Use GSI:\")\n", + "print(\" \u2022 Large-scale vector search applications\")\n", + "print(\" \u2022 High query-per-second (QPS) requirements\")\n", + "print(\" \u2022 Multi-user concurrent access scenarios\")\n", + "print(\" \u2022 Production environments requiring scalability\")\n", + "\n", + "print(\"\\n\" + \"=\"*80)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "This tutorial demonstrated how to use Mistral AI's embedding capabilities with Couchbase's GSI vector search, including comprehensive performance analysis. Key takeaways include:\n", + "\n", + "## What We Covered\n", + "\n", + "1. **Semantic Search Fundamentals**: Understanding how vector embeddings enable meaning-based search\n", + "2. **Mistral AI Integration**: Creating a custom LangChain wrapper for Mistral AI's powerful mistral-embed model\n", + "3. **Performance Testing**: Conducting baseline vs GSI-optimized performance comparisons\n", + "4. **GSI Index Types**: Understanding BHIVE (pure vector search) and COMPOSITE (filtered searches) indexes\n", + "5. **Index Configuration**: Learning about centroids, quantization, and optimization settings\n", + "\n", + "## Key Benefits of This Approach\n", + "\n", + "1. **High-Performance Vector Search**: GSI provides optimized vector operations with low latency\n", + "2. **Scalability**: BHIVE indexes designed to handle billions of vectors efficiently\n", + "3. **Production-Ready**: Optimal for applications requiring high QPS and concurrent access\n", + "4. **Flexible Configuration**: Customizable index settings for different use cases\n", + "5. **Advanced Filtering**: COMPOSITE indexes enable complex scalar + vector queries\n", + "\n", + "## Performance Insights\n", + "\n", + "- GSI benefits scale with dataset size and query load\n", + "- Small datasets may show modest improvements\n", + "- Production environments with millions of vectors see significant performance gains\n", + "- Consider your specific use case when choosing between BHIVE and COMPOSITE indexes\n", + "\n", + "## Next Steps\n", + "\n", + "- Scale your dataset to explore GSI performance at higher volumes\n", + "- Experiment with different index configurations (IVF centroids, quantization settings)\n", + "- Try COMPOSITE indexes for filtered search scenarios\n", + "- Integrate this solution into your production RAG or semantic search applications\n", + "\n", + "The combination of Mistral AI's embeddings and Couchbase's GSI vector search provides a powerful, scalable foundation for building intelligent search applications.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/mistralai/gsi/.env.sample b/mistralai/search_based/.env.sample similarity index 100% rename from mistralai/gsi/.env.sample rename to mistralai/search_based/.env.sample diff --git a/mistralai/fts/frontmatter.md b/mistralai/search_based/frontmatter.md similarity index 100% rename from mistralai/fts/frontmatter.md rename to mistralai/search_based/frontmatter.md diff --git a/mistralai/fts/mistralai.ipynb b/mistralai/search_based/mistralai.ipynb similarity index 92% rename from mistralai/fts/mistralai.ipynb rename to mistralai/search_based/mistralai.ipynb index 83f2360a..a3085781 100644 --- a/mistralai/fts/mistralai.ipynb +++ b/mistralai/search_based/mistralai.ipynb @@ -7,11 +7,11 @@ "source": [ "# Introduction\n", "\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Mistral AI](https://mistral.ai/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-mistralai-couchbase-vector-search-with-global-secondary-index)\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Mistral AI](https://mistral.ai/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively, if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-mistralai-couchbase-vector-search-with-hyperscale-or-composite-vector-index)\n", "\n", - "Couchbase is a NoSQL distributed document database (JSON) with many of the best features of a relational DBMS: SQL, distributed ACID transactions, and much more. [Couchbase Capella™](https://cloud.couchbase.com/sign-up) is the easiest way to get started, but you can also download and run [Couchbase Server](http://couchbase.com/downloads) on-premises.\n", + "Couchbase is a NoSQL distributed document database (JSON) with many of the best features of a relational DBMS: SQL, distributed ACID transactions, and much more. [Couchbase Capella\u2122](https://cloud.couchbase.com/sign-up) is the easiest way to get started, but you can also download and run [Couchbase Server](http://couchbase.com/downloads) on-premises.\n", "\n", - "Mistral AI is a research lab building the best open source models in the world. La Plateforme enables developers and enterprises to build new products and applications, powered by Mistral’s open source and commercial LLMs. \n", + "Mistral AI is a research lab building the best open source models in the world. La Plateforme enables developers and enterprises to build new products and applications, powered by Mistral\u2019s open source and commercial LLMs. \n", "\n", "The [Mistral AI APIs](https://console.mistral.ai/) empower LLM applications via:\n", "\n", @@ -151,7 +151,7 @@ "text": [ "Cluster URL: localhost\n", "Couchbase username: Administrator\n", - "Couchbase password: ········\n", + "Couchbase password: \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n", "Couchbase bucket: mistralai\n", "Couchbase scope: _default\n", "Couchbase collection: mistralai\n" @@ -260,8 +260,8 @@ "outputs": [], "source": [ "texts = [\n", - " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\",\n", - " \"It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", + " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\",\n", + " \"It\u2019s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", " input(\"custom embedding text\")\n", "]\n", "embeddings = mistral_client.embeddings.create(\n", @@ -335,7 +335,7 @@ "output_type": "stream", "text": [ "Found answer: 7a4c24dd-393f-4f08-ae42-69ea7009dcda; score: 1.7320726542316662\n", - "Answer text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n" + "Answer text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON\u2019s versatility, with a foundation that is extremely fast and scalable.\n" ] } ], @@ -389,4 +389,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/mistralai/fts/mistralai_index.json b/mistralai/search_based/mistralai_index.json similarity index 100% rename from mistralai/fts/mistralai_index.json rename to mistralai/search_based/mistralai_index.json diff --git a/openrouter-deepseek/gsi/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb b/openrouter-deepseek/gsi/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb deleted file mode 100644 index cbde7f8c..00000000 --- a/openrouter-deepseek/gsi/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb +++ /dev/null @@ -1,1089 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction \n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Deepseek V3 as the language model provider (via OpenRouter or direct API)](https://deepseek.ai/) and OpenAI for embeddings. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using GSI( Global Secondary Index) from scratch. Alternatively if you want to perform semantic search using the FTS index, please take a look at [this.](https://developer.couchbase.com/tutorial-openrouter-deepseek-with-fts/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/openrouter-deepseek/gsi/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "\n", - "## Get Credentials for OpenRouter and Deepseek\n", - "* Sign up for an account at [OpenRouter](https://openrouter.ai/) to get your API key\n", - "* OpenRouter provides access to Deepseek models, so no separate Deepseek credentials are needed\n", - "* Store your OpenRouter API key securely as it will be used to access the models\n", - "* For [Deepseek](https://deepseek.ai/) models, you can use the default models provided by OpenRouter\n", - "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting the Stage: Installing Necessary Libraries\n", - "\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-deepseek==0.1.3 langchain-openai==0.3.13 python-dotenv==1.1.1" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Importing Necessary Libraries\n", - "\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException,ServiceUnavailableException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_core.globals import set_llm_cache\n", - "from langchain_core.output_parsers import StrOutputParser\n", - "from langchain_core.prompts.chat import ChatPromptTemplate\n", - "from langchain_core.runnables import RunnablePassthrough\n", - "from langchain_couchbase.cache import CouchbaseCache\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy\n", - "from langchain_couchbase.vectorstores import IndexType\n", - "from langchain_openai import OpenAIEmbeddings" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", - "\n", - "# Suppress httpx logging\n", - "logging.getLogger('httpx').setLevel(logging.CRITICAL)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Environment Variables and Configuration\n", - "\n", - "This section handles loading and validating environment variables and configuration settings:\n", - "#\n", - "1. API Keys:\n", - " - Supports either direct Deepseek API or OpenRouter API access\n", - " - Prompts for API key input if not found in environment\n", - " - Requires OpenAI API key for embeddings\n", - "#\n", - "2. Couchbase Settings:\n", - " - Connection details (host, username, password)\n", - " - Bucket, scope and collection names\n", - " - Vector search index configuration\n", - " - Cache collection settings\n", - "#\n", - "The code validates that all required credentials are present before proceeding.\n", - "It allows flexible configuration through environment variables or interactive prompts,\n", - "with sensible defaults for local development.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "# Load environment variables from .env file if it exists\n", - "load_dotenv(override= True)\n", - "\n", - "# API Keys\n", - "# Allow either Deepseek API directly or via OpenRouter\n", - "DEEPSEEK_API_KEY = os.getenv('DEEPSEEK_API_KEY')\n", - "OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY')\n", - "\n", - "if not DEEPSEEK_API_KEY and not OPENROUTER_API_KEY:\n", - " api_choice = input('Choose API (1 for Deepseek direct, 2 for OpenRouter): ')\n", - " if api_choice == '1':\n", - " DEEPSEEK_API_KEY = getpass.getpass('Enter your Deepseek API Key: ')\n", - " else:\n", - " OPENROUTER_API_KEY = getpass.getpass('Enter your OpenRouter API Key: ')\n", - "\n", - "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API Key: ')\n", - "\n", - "# Couchbase Settings\n", - "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: deepseek): ') or 'deepseek'\n", - "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", - "\n", - "# Check if required credentials are set\n", - "required_creds = {\n", - " 'OPENAI_API_KEY': OPENAI_API_KEY,\n", - " 'CB_HOST': CB_HOST,\n", - " 'CB_USERNAME': CB_USERNAME,\n", - " 'CB_PASSWORD': CB_PASSWORD,\n", - " 'CB_BUCKET_NAME': CB_BUCKET_NAME\n", - "}\n", - "\n", - "# Add the API key that was chosen\n", - "if DEEPSEEK_API_KEY:\n", - " required_creds['DEEPSEEK_API_KEY'] = DEEPSEEK_API_KEY\n", - "elif OPENROUTER_API_KEY:\n", - " required_creds['OPENROUTER_API_KEY'] = OPENROUTER_API_KEY\n", - "else:\n", - " raise ValueError(\"Either Deepseek API Key or OpenRouter API Key must be provided\")\n", - "\n", - "for cred_name, cred_value in required_creds.items():\n", - " if not cred_value:\n", - " raise ValueError(f\"{cred_name} is not set\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 15:40:27,133 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: If you are using Capella, create a bucket manually called vector-search-testing(or any name you prefer) with the same properties.\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results\n" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 15:41:01,398 - INFO - Bucket 'query-vector-search-testing' exists.\n", - "2025-09-17 15:41:01,410 - INFO - Collection 'deepseek' does not exist. Creating it...\n", - "2025-09-17 15:41:01,453 - INFO - Collection 'deepseek' created successfully.\n", - "2025-09-17 15:41:03,712 - INFO - All documents cleared from the collection.\n", - "2025-09-17 15:41:03,713 - INFO - Bucket 'query-vector-search-testing' exists.\n", - "2025-09-17 15:41:03,728 - INFO - Collection 'cache' already exists. Skipping creation.\n", - "2025-09-17 15:41:05,821 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Creating the Embeddings client\n", - "This section creates an OpenAI embeddings client using the OpenAI API key.\n", - "The embeddings client is configured to use the \"text-embedding-3-small\" model,\n", - "which converts text into numerical vector representations.\n", - "These vector embeddings are essential for semantic search and similarity matching.\n", - "The client will be used by the vector store to generate embeddings for documents." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 15:41:27,149 - INFO - Successfully created OpenAI embeddings client\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = OpenAIEmbeddings(\n", - " api_key=OPENAI_API_KEY,\n", - " model=\"text-embedding-3-small\"\n", - " )\n", - " logging.info(\"Successfully created OpenAI embeddings client\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating OpenAI embeddings client: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up the Couchbase Vector Store\n", - "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 15:41:55,394 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseQueryVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding = embeddings,\n", - " distance_metric=DistanceStrategy.COSINE\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 15:42:04,530 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "3. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 50 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 16:08:51,054 - INFO - Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 50\n", - "\n", - "# Automatic Batch Processing\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up the LLM Model\n", - "In this section, we set up the Large Language Model (LLM) for our RAG system. We're using the Deepseek model, which can be accessed through two different methods:\n", - "\n", - "1. **Deepseek API Key**: This is obtained directly from Deepseek's platform (https://deepseek.ai) by creating an account and subscribing to their API services. With this key, you can access Deepseek's models directly using the `ChatDeepSeek` class from the `langchain_deepseek` package.\n", - "\n", - "2. **OpenRouter API Key**: OpenRouter (https://openrouter.ai) is a service that provides unified access to multiple LLM providers, including Deepseek. You can obtain an API key by creating an account on OpenRouter's website. This approach uses the `ChatOpenAI` class from `langchain_openai` but with a custom base URL pointing to OpenRouter's API endpoint.\n", - "\n", - "The key difference is that OpenRouter acts as an intermediary service that can route your requests to various LLM providers, while the Deepseek API gives you direct access to only Deepseek's models. OpenRouter can be useful if you want to switch between different LLM providers without changing your code significantly.\n", - "\n", - "In our implementation, we check for both keys and prioritize using the Deepseek API directly if available, falling back to OpenRouter if not. The model is configured with temperature=0 to ensure deterministic, focused responses suitable for RAG applications.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-18 11:18:25,192 - INFO - Successfully created Deepseek LLM client through OpenRouter\n" - ] - } - ], - "source": [ - "from langchain_deepseek import ChatDeepSeek\n", - "from langchain_openai import ChatOpenAI\n", - "\n", - "if DEEPSEEK_API_KEY:\n", - " try:\n", - " llm = ChatDeepSeek(\n", - " api_key=DEEPSEEK_API_KEY,\n", - " model_name=\"deepseek-chat\",\n", - " temperature=0\n", - " )\n", - " logging.info(\"Successfully created Deepseek LLM client\")\n", - " except Exception as e:\n", - " raise ValueError(f\"Error creating Deepseek LLM client: {str(e)}\")\n", - "elif OPENROUTER_API_KEY:\n", - " try:\n", - " llm = ChatOpenAI(\n", - " api_key=OPENROUTER_API_KEY,\n", - " base_url=\"https://openrouter.ai/api/v1\",\n", - " model=\"deepseek/deepseek-chat-v3.1\", \n", - " temperature=0,\n", - " )\n", - " logging.info(\"Successfully created Deepseek LLM client through OpenRouter\")\n", - " except Exception as e:\n", - " raise ValueError(f\"Error creating Deepseek LLM client: {str(e)}\")\n", - "else:\n", - " raise ValueError(\"Either Deepseek API Key or OpenRouter API Key must be provided\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 16:11:07,177 - INFO - Semantic search completed in 2.46 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 2.46 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3693, Text: The Littler effect - how darts hit the bullseye\n", - "\n", - "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", - "\n", - "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", - "\n", - "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned £200,000 for finishing second - part of £1m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", - "\n", - "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", - "\n", - "Littler beat Luke Humphries to win the Premier League title in May\n", - "\n", - "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like £300,000 for the year. This year it will go to £20m. I expect in five years' time, we'll be playing for £40m.\"\n", - "\n", - "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil ‘The Power’ Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", - "• None Know a lot about Littler? Take our quiz\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3900, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", - "\n", - "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night – five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", - "\n", - "Littler was hugged by his parents after victory over Meikle\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80)\n", - "\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80)\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Optimizing Vector Search with Global Secondary Index (GSI)\n", - "\n", - "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Global Secondary Index (GSI) in Couchbase.\n", - "\n", - "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", - "\n", - "Hyperscale Vector Indexes (BHIVE)\n", - "- Best for pure vector searches - content discovery, recommendations, semantic search\n", - "- High performance with low memory footprint - designed to scale to billions of vectors\n", - "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", - "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", - "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", - "\n", - "Composite Vector Indexes \n", - "- Best for filtered vector searches - combines vector search with scalar value filtering\n", - "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", - "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", - "\n", - "Choosing the Right Index Type\n", - "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", - "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", - "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", - "\n", - "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", - "\n", - "\n", - "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", - "\n", - "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", - "\n", - "Format: `'IVF[],{PQ|SQ}'`\n", - "\n", - "Centroids (IVF - Inverted File):\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- More centroids = faster search, slower training \n", - "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", - "\n", - "Quantization Options:\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", - "- Higher values = better accuracy, larger index size\n", - "\n", - "Common Examples:\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"openrouterdeepseek_bhive_index\",index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The example below shows running the same similarity search, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data.\n", - "\n", - "**Important**: When using Composite indexes, scalar filters take precedence over vector similarity, which can improve performance for filtered searches but may miss some semantically relevant results that don't match the scalar criteria.\n", - "\n", - "Note: In GSI vector search, the distance represents the vector distance between the query and document embeddings. Lower distance indicate higher similarity, while higher distance indicate lower similarity." - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-18 11:17:19,626 - INFO - Semantic search completed in 0.88 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 0.88 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3694, Text: The Littler effect - how darts hit the bullseye\n", - "\n", - "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", - "\n", - "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", - "\n", - "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned £200,000 for finishing second - part of £1m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", - "\n", - "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", - "\n", - "Littler beat Luke Humphries to win the Premier League title in May\n", - "\n", - "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like £300,000 for the year. This year it will go to £20m. I expect in five years' time, we'll be playing for £40m.\"\n", - "\n", - "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil ‘The Power’ Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", - "• None Know a lot about Littler? Take our quiz\n", - "--------------------------------------------------------------------------------\n", - "Distance: 0.3901, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", - "\n", - "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night – five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", - "\n", - "Littler was hugged by his parents after victory over Meikle\n", - "\n", - "... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "\n", - "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " search_elapsed_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", - " print(\"-\" * 80)\n", - "\n", - " for doc, score in search_results:\n", - " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80)\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: To create a COMPOSITE index, the below code can be used.\n", - "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"openrouterdeepseek_composite_index\", index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up a Couchbase Cache\n", - "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", - "\n", - "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-17 16:10:11,473 - INFO - Successfully created cache\n" - ] - } - ], - "source": [ - "try:\n", - " cache = CouchbaseCache(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=CACHE_COLLECTION,\n", - " )\n", - " logging.info(\"Successfully created cache\")\n", - " set_llm_cache(cache)\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create cache: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", - "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-09-18 11:18:34,032 - INFO - Successfully created RAG chain\n" - ] - } - ], - "source": [ - "# Create RAG prompt template\n", - "rag_prompt = ChatPromptTemplate.from_messages([\n", - " (\"system\", \"You are a helpful assistant that answers questions based on the provided context.\"),\n", - " (\"human\", \"Context: {context}\\n\\nQuestion: {question}\")\n", - "])\n", - "\n", - "# Create RAG chain\n", - "rag_chain = (\n", - " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", - " | rag_prompt\n", - " | llm\n", - " | StrOutputParser()\n", - ")\n", - "logging.info(\"Successfully created RAG chain\")" - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RAG Response: Based on the provided context, Luke Littler's key achievements and records in his recent PDC World Championship match (second-round win against Ryan Meikle) were:\n", - "\n", - "* **Tournament Record Set Average:** He hit a tournament record 140.91 set average during the match.\n", - "* **Near Nine-Darter:** He was \"millimetres away from a nine-darter\" when he missed double 12.\n", - "* **Dominant Final Set:** He won the fourth and final set in just 32 darts (the minimum possible is 27), which included hitting four maximum 180s and clinching three straight legs in 11, 10, and 11 darts.\n", - "* **Overall High Average:** He maintained a high overall match average of 100.85.\n", - "RAG response generated in 0.49 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " start_time = time.time()\n", - " rag_response = rag_chain.invoke(query)\n", - " rag_elapsed_time = time.time() - start_time\n", - "\n", - " print(f\"RAG Response: {rag_response}\")\n", - " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Using Couchbase as a caching mechanism\n", - "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", - "\n", - "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Query 1: What happened in the match between Fullham and Liverpool?\n", - "Response: In the match between Fulham and Liverpool, Liverpool played the majority of the game with 10 men after Andy Robertson received a red card in the 17th minute. Despite being a player down, Liverpool came from behind twice to secure a 2-2 draw. Diogo Jota scored an 86th-minute equalizer to earn Liverpool a point. The performance was praised for its resilience, with Fulham's Antonee Robinson noting that Liverpool \"didn't feel like they had 10 men at all.\" Liverpool maintained over 60% possession and led in attacking metrics such as shots and chances. Both managers acknowledged the strong efforts of their teams in what was described as an enthralling encounter.\n", - "Time taken: 4.65 seconds\n", - "\n", - "Query 2: What were Luke Littler's key achievements and records in his recent PDC World Championship match?\n", - "Response: Based on the provided context, Luke Littler's key achievements and records in his recent PDC World Championship match (second-round win against Ryan Meikle) were:\n", - "\n", - "* **Tournament Record Set Average:** He hit a tournament record 140.91 set average during the match.\n", - "* **Near Nine-Darter:** He was \"millimetres away from a nine-darter\" when he missed double 12.\n", - "* **Dominant Final Set:** He won the fourth and final set in just 32 darts (the minimum possible is 27), which included hitting four maximum 180s and clinching three straight legs in 11, 10, and 11 darts.\n", - "* **Overall High Average:** He maintained a high overall match average of 100.85.\n", - "Time taken: 0.45 seconds\n", - "\n", - "Query 3: What happened in the match between Fullham and Liverpool?\n", - "Response: In the match between Fulham and Liverpool, Liverpool played the majority of the game with 10 men after Andy Robertson received a red card in the 17th minute. Despite being a player down, Liverpool came from behind twice to secure a 2-2 draw. Diogo Jota scored an 86th-minute equalizer to earn Liverpool a point. The performance was praised for its resilience, with Fulham's Antonee Robinson noting that Liverpool \"didn't feel like they had 10 men at all.\" Liverpool maintained over 60% possession and led in attacking metrics such as shots and chances. Both managers acknowledged the strong efforts of their teams in what was described as an enthralling encounter.\n", - "Time taken: 1.15 seconds\n" - ] - } - ], - "source": [ - "try:\n", - " queries = [\n", - " \"What happened in the match between Fullham and Liverpool?\",\n", - " \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\", # Repeated query\n", - " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", - " ]\n", - "\n", - " for i, query in enumerate(queries, 1):\n", - " print(f\"\\nQuery {i}: {query}\")\n", - " start_time = time.time()\n", - "\n", - " response = rag_chain.invoke(query)\n", - " elapsed_time = time.time() - start_time\n", - " print(f\"Response: {response}\")\n", - " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", - "\n", - "except InternalServerFailureException as e:\n", - " if \"query request rejected\" in str(e):\n", - " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", - " else:\n", - " print(f\"Internal server error occurred: {str(e)}\")\n", - "except Exception as e:\n", - " print(f\"Unexpected error occurred: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and Deepseek(via Openrouter). This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/openrouter-deepseek/gsi/.env.sample b/openrouter-deepseek/query_based/.env.sample similarity index 100% rename from openrouter-deepseek/gsi/.env.sample rename to openrouter-deepseek/query_based/.env.sample diff --git a/openrouter-deepseek/query_based/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb b/openrouter-deepseek/query_based/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb new file mode 100644 index 00000000..7b3caf65 --- /dev/null +++ b/openrouter-deepseek/query_based/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb @@ -0,0 +1,1089 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction \n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Deepseek V3 as the language model provider (via OpenRouter or direct API)](https://deepseek.ai/) and OpenAI for embeddings. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Hyperscale and Composite Vector Indexes from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Search Vector Index, please take a look at [this.](https://developer.couchbase.com/tutorial-openrouter-deepseek-with-search-vector-index/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/openrouter-deepseek/gsi/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for OpenRouter and Deepseek\n", + "* Sign up for an account at [OpenRouter](https://openrouter.ai/) to get your API key\n", + "* OpenRouter provides access to Deepseek models, so no separate Deepseek credentials are needed\n", + "* Store your OpenRouter API key securely as it will be used to access the models\n", + "* For [Deepseek](https://deepseek.ai/) models, you can use the default models provided by OpenRouter\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting the Stage: Installing Necessary Libraries\n", + "\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet datasets==3.5.0 langchain-couchbase==0.5.0 langchain-deepseek==0.1.3 langchain-openai==0.3.13 python-dotenv==1.1.1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Importing Necessary Libraries\n", + "\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException,ServiceUnavailableException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_core.globals import set_llm_cache\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts.chat import ChatPromptTemplate\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_couchbase.cache import CouchbaseCache\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy\n", + "from langchain_couchbase.vectorstores import IndexType\n", + "from langchain_openai import OpenAIEmbeddings" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", + "\n", + "# Suppress httpx logging\n", + "logging.getLogger('httpx').setLevel(logging.CRITICAL)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Environment Variables and Configuration\n", + "\n", + "This section handles loading and validating environment variables and configuration settings:\n", + "#\n", + "1. API Keys:\n", + " - Supports either direct Deepseek API or OpenRouter API access\n", + " - Prompts for API key input if not found in environment\n", + " - Requires OpenAI API key for embeddings\n", + "#\n", + "2. Couchbase Settings:\n", + " - Connection details (host, username, password)\n", + " - Bucket, scope and collection names\n", + " - Vector search index configuration\n", + " - Cache collection settings\n", + "#\n", + "The code validates that all required credentials are present before proceeding.\n", + "It allows flexible configuration through environment variables or interactive prompts,\n", + "with sensible defaults for local development.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables from .env file if it exists\n", + "load_dotenv(override= True)\n", + "\n", + "# API Keys\n", + "# Allow either Deepseek API directly or via OpenRouter\n", + "DEEPSEEK_API_KEY = os.getenv('DEEPSEEK_API_KEY')\n", + "OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY')\n", + "\n", + "if not DEEPSEEK_API_KEY and not OPENROUTER_API_KEY:\n", + " api_choice = input('Choose API (1 for Deepseek direct, 2 for OpenRouter): ')\n", + " if api_choice == '1':\n", + " DEEPSEEK_API_KEY = getpass.getpass('Enter your Deepseek API Key: ')\n", + " else:\n", + " OPENROUTER_API_KEY = getpass.getpass('Enter your OpenRouter API Key: ')\n", + "\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API Key: ')\n", + "\n", + "# Couchbase Settings\n", + "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: deepseek): ') or 'deepseek'\n", + "CACHE_COLLECTION = os.getenv('CACHE_COLLECTION') or input('Enter your cache collection name (default: cache): ') or 'cache'\n", + "\n", + "# Check if required credentials are set\n", + "required_creds = {\n", + " 'OPENAI_API_KEY': OPENAI_API_KEY,\n", + " 'CB_HOST': CB_HOST,\n", + " 'CB_USERNAME': CB_USERNAME,\n", + " 'CB_PASSWORD': CB_PASSWORD,\n", + " 'CB_BUCKET_NAME': CB_BUCKET_NAME\n", + "}\n", + "\n", + "# Add the API key that was chosen\n", + "if DEEPSEEK_API_KEY:\n", + " required_creds['DEEPSEEK_API_KEY'] = DEEPSEEK_API_KEY\n", + "elif OPENROUTER_API_KEY:\n", + " required_creds['OPENROUTER_API_KEY'] = OPENROUTER_API_KEY\n", + "else:\n", + " raise ValueError(\"Either Deepseek API Key or OpenRouter API Key must be provided\")\n", + "\n", + "for cred_name, cred_value in required_creds.items():\n", + " if not cred_value:\n", + " raise ValueError(f\"{cred_name} is not set\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 15:40:27,133 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: If you are using Capella, create a bucket manually called vector-search-testing(or any name you prefer) with the same properties.\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 15:41:01,398 - INFO - Bucket 'query-vector-search-testing' exists.\n", + "2025-09-17 15:41:01,410 - INFO - Collection 'deepseek' does not exist. Creating it...\n", + "2025-09-17 15:41:01,453 - INFO - Collection 'deepseek' created successfully.\n", + "2025-09-17 15:41:03,712 - INFO - All documents cleared from the collection.\n", + "2025-09-17 15:41:03,713 - INFO - Bucket 'query-vector-search-testing' exists.\n", + "2025-09-17 15:41:03,728 - INFO - Collection 'cache' already exists. Skipping creation.\n", + "2025-09-17 15:41:05,821 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating the Embeddings client\n", + "This section creates an OpenAI embeddings client using the OpenAI API key.\n", + "The embeddings client is configured to use the \"text-embedding-3-small\" model,\n", + "which converts text into numerical vector representations.\n", + "These vector embeddings are essential for semantic search and similarity matching.\n", + "The client will be used by the vector store to generate embeddings for documents." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 15:41:27,149 - INFO - Successfully created OpenAI embeddings client\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = OpenAIEmbeddings(\n", + " api_key=OPENAI_API_KEY,\n", + " model=\"text-embedding-3-small\"\n", + " )\n", + " logging.info(\"Successfully created OpenAI embeddings client\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating OpenAI embeddings client: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up the Couchbase Vector Store\n", + "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 15:41:55,394 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding = embeddings,\n", + " distance_metric=DistanceStrategy.COSINE\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 15:42:04,530 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "3. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 50 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 16:08:51,054 - INFO - Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 50\n", + "\n", + "# Automatic Batch Processing\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up the LLM Model\n", + "In this section, we set up the Large Language Model (LLM) for our RAG system. We're using the Deepseek model, which can be accessed through two different methods:\n", + "\n", + "1. **Deepseek API Key**: This is obtained directly from Deepseek's platform (https://deepseek.ai) by creating an account and subscribing to their API services. With this key, you can access Deepseek's models directly using the `ChatDeepSeek` class from the `langchain_deepseek` package.\n", + "\n", + "2. **OpenRouter API Key**: OpenRouter (https://openrouter.ai) is a service that provides unified access to multiple LLM providers, including Deepseek. You can obtain an API key by creating an account on OpenRouter's website. This approach uses the `ChatOpenAI` class from `langchain_openai` but with a custom base URL pointing to OpenRouter's API endpoint.\n", + "\n", + "The key difference is that OpenRouter acts as an intermediary service that can route your requests to various LLM providers, while the Deepseek API gives you direct access to only Deepseek's models. OpenRouter can be useful if you want to switch between different LLM providers without changing your code significantly.\n", + "\n", + "In our implementation, we check for both keys and prioritize using the Deepseek API directly if available, falling back to OpenRouter if not. The model is configured with temperature=0 to ensure deterministic, focused responses suitable for RAG applications.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-18 11:18:25,192 - INFO - Successfully created Deepseek LLM client through OpenRouter\n" + ] + } + ], + "source": [ + "from langchain_deepseek import ChatDeepSeek\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "if DEEPSEEK_API_KEY:\n", + " try:\n", + " llm = ChatDeepSeek(\n", + " api_key=DEEPSEEK_API_KEY,\n", + " model_name=\"deepseek-chat\",\n", + " temperature=0\n", + " )\n", + " logging.info(\"Successfully created Deepseek LLM client\")\n", + " except Exception as e:\n", + " raise ValueError(f\"Error creating Deepseek LLM client: {str(e)}\")\n", + "elif OPENROUTER_API_KEY:\n", + " try:\n", + " llm = ChatOpenAI(\n", + " api_key=OPENROUTER_API_KEY,\n", + " base_url=\"https://openrouter.ai/api/v1\",\n", + " model=\"deepseek/deepseek-chat-v3.1\", \n", + " temperature=0,\n", + " )\n", + " logging.info(\"Successfully created Deepseek LLM client through OpenRouter\")\n", + " except Exception as e:\n", + " raise ValueError(f\"Error creating Deepseek LLM client: {str(e)}\")\n", + "else:\n", + " raise ValueError(\"Either Deepseek API Key or OpenRouter API Key must be provided\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 16:11:07,177 - INFO - Semantic search completed in 2.46 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 2.46 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3693, Text: The Littler effect - how darts hit the bullseye\n", + "\n", + "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", + "\n", + "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", + "\n", + "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned \u00a3200,000 for finishing second - part of \u00a31m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", + "\n", + "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", + "\n", + "Littler beat Luke Humphries to win the Premier League title in May\n", + "\n", + "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like \u00a3300,000 for the year. This year it will go to \u00a320m. I expect in five years' time, we'll be playing for \u00a340m.\"\n", + "\n", + "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil \u2018The Power\u2019 Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", + "\u2022 None Know a lot about Littler? Take our quiz\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3900, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", + "\n", + "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night \u2013 five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", + "\n", + "Littler was hugged by his parents after victory over Meikle\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80)\n", + "\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80)\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", + "\n", + "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Couchbase Hyperscale and Composite Vector Indexes in Couchbase.\n", + "\n", + "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", + "\n", + "Hyperscale Vector Indexes (BHIVE)\n", + "- Best for pure vector searches - content discovery, recommendations, semantic search\n", + "- High performance with low memory footprint - designed to scale to billions of vectors\n", + "- Optimized for concurrent operations - supports simultaneous searches and inserts\n", + "- Use when: You primarily perform vector-only queries without complex scalar filtering\n", + "- Ideal for: Large-scale semantic search, recommendation systems, content discovery\n", + "\n", + "Composite Vector Indexes \n", + "- Best for filtered vector searches - combines vector search with scalar value filtering\n", + "- Efficient pre-filtering - scalar attributes reduce the vector comparison scope\n", + "- Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- Ideal for: Compliance-based filtering, user-specific searches, time-bounded queries\n", + "\n", + "Choosing the Right Index Type\n", + "- Start with Hyperscale Vector Index for pure vector searches and large datasets\n", + "- Use Composite Vector Index when scalar filters significantly reduce your search space\n", + "- Consider your dataset size: Hyperscale scales to billions, Composite works well for tens of millions to billions\n", + "\n", + "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", + "\n", + "\n", + "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", + "\n", + "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", + "\n", + "Format: `'IVF[],{PQ|SQ}'`\n", + "\n", + "Centroids (IVF - Inverted File):\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- More centroids = faster search, slower training \n", + "- Fewer centroids = slower search, faster training\n", + "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "\n", + "Quantization Options:\n", + "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- Higher values = better accuracy, larger index size\n", + "\n", + "Common Examples:\n", + "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", + "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", + "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, Hyperscale and Composite Vector indexes can be created manually from the Couchbase UI." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"openrouterdeepseek_bhive_index\",index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The example below shows running the same similarity search, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data.\n", + "\n", + "**Important**: When using Composite indexes, scalar filters take precedence over vector similarity, which can improve performance for filtered searches but may miss some semantically relevant results that don't match the scalar criteria.\n", + "\n", + "Note: In GSI vector search, the distance represents the vector distance between the query and document embeddings. Lower distance indicate higher similarity, while higher distance indicate lower similarity." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-18 11:17:19,626 - INFO - Semantic search completed in 0.88 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 0.88 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3694, Text: The Littler effect - how darts hit the bullseye\n", + "\n", + "Teenager Luke Littler began his bid to win the 2025 PDC World Darts Championship with a second-round win against Ryan Meikle. Here we assess Littler's impact after a remarkable rise which saw him named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson.\n", + "\n", + "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", + "\n", + "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned \u00a3200,000 for finishing second - part of \u00a31m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", + "\n", + "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", + "\n", + "Littler beat Luke Humphries to win the Premier League title in May\n", + "\n", + "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like \u00a3300,000 for the year. This year it will go to \u00a320m. I expect in five years' time, we'll be playing for \u00a340m.\"\n", + "\n", + "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil \u2018The Power\u2019 Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", + "\u2022 None Know a lot about Littler? Take our quiz\n", + "--------------------------------------------------------------------------------\n", + "Distance: 0.3901, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", + "\n", + "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night \u2013 five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", + "\n", + "Littler was hugged by his parents after victory over Meikle\n", + "\n", + "... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "\n", + "query = \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " search_elapsed_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {search_elapsed_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(\"-\" * 80)\n", + "\n", + " for doc, score in search_results:\n", + " print(f\"Distance: {score:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80)\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: To create a COMPOSITE index, the below code can be used.\n", + "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"openrouterdeepseek_composite_index\", index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up a Couchbase Cache\n", + "To further optimize our system, we set up a Couchbase-based cache. A cache is a temporary storage layer that holds data that is frequently accessed, speeding up operations by reducing the need to repeatedly retrieve the same information from the database. In our setup, the cache will help us accelerate repetitive tasks, such as looking up similar documents. By implementing a cache, we enhance the overall performance of our search engine, ensuring that it can handle high query volumes and deliver results quickly.\n", + "\n", + "Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-17 16:10:11,473 - INFO - Successfully created cache\n" + ] + } + ], + "source": [ + "try:\n", + " cache = CouchbaseCache(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=CACHE_COLLECTION,\n", + " )\n", + " logging.info(\"Successfully created cache\")\n", + " set_llm_cache(cache)\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create cache: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "\n", + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-09-18 11:18:34,032 - INFO - Successfully created RAG chain\n" + ] + } + ], + "source": [ + "# Create RAG prompt template\n", + "rag_prompt = ChatPromptTemplate.from_messages([\n", + " (\"system\", \"You are a helpful assistant that answers questions based on the provided context.\"),\n", + " (\"human\", \"Context: {context}\\n\\nQuestion: {question}\")\n", + "])\n", + "\n", + "# Create RAG chain\n", + "rag_chain = (\n", + " {\"context\": vector_store.as_retriever(), \"question\": RunnablePassthrough()}\n", + " | rag_prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")\n", + "logging.info(\"Successfully created RAG chain\")" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG Response: Based on the provided context, Luke Littler's key achievements and records in his recent PDC World Championship match (second-round win against Ryan Meikle) were:\n", + "\n", + "* **Tournament Record Set Average:** He hit a tournament record 140.91 set average during the match.\n", + "* **Near Nine-Darter:** He was \"millimetres away from a nine-darter\" when he missed double 12.\n", + "* **Dominant Final Set:** He won the fourth and final set in just 32 darts (the minimum possible is 27), which included hitting four maximum 180s and clinching three straight legs in 11, 10, and 11 darts.\n", + "* **Overall High Average:** He maintained a high overall match average of 100.85.\n", + "RAG response generated in 0.49 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " start_time = time.time()\n", + " rag_response = rag_chain.invoke(query)\n", + " rag_elapsed_time = time.time() - start_time\n", + "\n", + " print(f\"RAG Response: {rag_response}\")\n", + " print(f\"RAG response generated in {rag_elapsed_time:.2f} seconds\")\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using Couchbase as a caching mechanism\n", + "Couchbase can be effectively used as a caching mechanism for RAG (Retrieval-Augmented Generation) responses by storing and retrieving precomputed results for specific queries. This approach enhances the system's efficiency and speed, particularly when dealing with repeated or similar queries. When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the language model, and then stores this response in Couchbase, with the query serving as the key.\n", + "\n", + "For subsequent requests with the same query, the system checks Couchbase first. If a cached response is found, it is retrieved directly from Couchbase, bypassing the need to re-run the entire RAG process. This significantly reduces response time because the computationally expensive steps of document retrieval and response generation are skipped. Couchbase's role in this setup is to provide a fast and scalable storage solution for caching these responses, ensuring that frequently asked queries can be answered more quickly and efficiently.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Query 1: What happened in the match between Fullham and Liverpool?\n", + "Response: In the match between Fulham and Liverpool, Liverpool played the majority of the game with 10 men after Andy Robertson received a red card in the 17th minute. Despite being a player down, Liverpool came from behind twice to secure a 2-2 draw. Diogo Jota scored an 86th-minute equalizer to earn Liverpool a point. The performance was praised for its resilience, with Fulham's Antonee Robinson noting that Liverpool \"didn't feel like they had 10 men at all.\" Liverpool maintained over 60% possession and led in attacking metrics such as shots and chances. Both managers acknowledged the strong efforts of their teams in what was described as an enthralling encounter.\n", + "Time taken: 4.65 seconds\n", + "\n", + "Query 2: What were Luke Littler's key achievements and records in his recent PDC World Championship match?\n", + "Response: Based on the provided context, Luke Littler's key achievements and records in his recent PDC World Championship match (second-round win against Ryan Meikle) were:\n", + "\n", + "* **Tournament Record Set Average:** He hit a tournament record 140.91 set average during the match.\n", + "* **Near Nine-Darter:** He was \"millimetres away from a nine-darter\" when he missed double 12.\n", + "* **Dominant Final Set:** He won the fourth and final set in just 32 darts (the minimum possible is 27), which included hitting four maximum 180s and clinching three straight legs in 11, 10, and 11 darts.\n", + "* **Overall High Average:** He maintained a high overall match average of 100.85.\n", + "Time taken: 0.45 seconds\n", + "\n", + "Query 3: What happened in the match between Fullham and Liverpool?\n", + "Response: In the match between Fulham and Liverpool, Liverpool played the majority of the game with 10 men after Andy Robertson received a red card in the 17th minute. Despite being a player down, Liverpool came from behind twice to secure a 2-2 draw. Diogo Jota scored an 86th-minute equalizer to earn Liverpool a point. The performance was praised for its resilience, with Fulham's Antonee Robinson noting that Liverpool \"didn't feel like they had 10 men at all.\" Liverpool maintained over 60% possession and led in attacking metrics such as shots and chances. Both managers acknowledged the strong efforts of their teams in what was described as an enthralling encounter.\n", + "Time taken: 1.15 seconds\n" + ] + } + ], + "source": [ + "try:\n", + " queries = [\n", + " \"What happened in the match between Fullham and Liverpool?\",\n", + " \"What were Luke Littler's key achievements and records in his recent PDC World Championship match?\", # Repeated query\n", + " \"What happened in the match between Fullham and Liverpool?\", # Repeated query\n", + " ]\n", + "\n", + " for i, query in enumerate(queries, 1):\n", + " print(f\"\\nQuery {i}: {query}\")\n", + " start_time = time.time()\n", + "\n", + " response = rag_chain.invoke(query)\n", + " elapsed_time = time.time() - start_time\n", + " print(f\"Response: {response}\")\n", + " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", + "\n", + "except InternalServerFailureException as e:\n", + " if \"query request rejected\" in str(e):\n", + " print(\"Error: Search request was rejected due to rate limiting. Please try again later.\")\n", + " else:\n", + " print(f\"Internal server error occurred: {str(e)}\")\n", + "except Exception as e:\n", + " print(f\"Unexpected error occurred: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and Deepseek(via Openrouter). This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/openrouter-deepseek/gsi/frontmatter.md b/openrouter-deepseek/query_based/frontmatter.md similarity index 100% rename from openrouter-deepseek/gsi/frontmatter.md rename to openrouter-deepseek/query_based/frontmatter.md diff --git a/openrouter-deepseek/fts/.env.sample b/openrouter-deepseek/search_based/.env.sample similarity index 100% rename from openrouter-deepseek/fts/.env.sample rename to openrouter-deepseek/search_based/.env.sample diff --git a/openrouter-deepseek/fts/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb b/openrouter-deepseek/search_based/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb similarity index 89% rename from openrouter-deepseek/fts/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb rename to openrouter-deepseek/search_based/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb index 387bb2bd..8b45e6ba 100644 --- a/openrouter-deepseek/fts/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb +++ b/openrouter-deepseek/search_based/RAG_with_Couchbase_and_Openrouter_Deepseek.ipynb @@ -5,7 +5,7 @@ "metadata": {}, "source": [ "# Introduction \n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Deepseek V3 as the language model provider (via OpenRouter or direct API)](https://deepseek.ai/) and OpenAI for embeddings. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using the FTS service from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-openrouter-deepseek-with-global-secondary-index/)" + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Deepseek V3 as the language model provider (via OpenRouter or direct API)](https://deepseek.ai/) and OpenAI for embeddings. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Search Vector Index from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-openrouter-deepseek-with-hyperscale-or-composite-vector-index/)" ] }, { @@ -786,20 +786,20 @@ "\n", "One year ago, he was barely a household name in his own home. Now he is a sporting phenomenon. After emerging from obscurity aged 16 to reach the World Championship final, the life of Luke Littler and the sport he loves has been transformed. Viewing figures, ticket sales and social media interest have rocketed. Darts has hit the bullseye. This Christmas more than 100,000 children are expected to be opening Littler-branded magnetic dartboards as presents. His impact has helped double the number of junior academies, prompted plans to expand the World Championship and generated interest in darts from Saudi Arabian backers.\n", "\n", - "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned £200,000 for finishing second - part of £1m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", + "Just months after taking his GCSE exams and ranked 164th in the world, Littler beat former champions Raymond van Barneveld and Rob Cross en route to the PDC World Championship final in January, before his run ended with a 7-4 loss to Luke Humphries. With his nickname 'The Nuke' on his purple and yellow shirt and the Alexandra Palace crowd belting out his walk-on song, Pitbull's tune Greenlight, he became an instant hit. Electric on the stage, calm off it. The down-to-earth teenager celebrated with a kebab and computer games. \"We've been watching his progress since he was about seven. He was on our radar, but we never anticipated what would happen. The next thing we know 'Littlermania' is spreading everywhere,\" PDC president Barry Hearn told BBC Sport. A peak TV audience of 3.7 million people watched the final - easily Sky's biggest figure for a non-football sporting event. The teenager from Warrington in Cheshire was too young to legally drive or drink alcohol, but earned \u00a3200,000 for finishing second - part of \u00a31m prize money in his first year as a professional - and an invitation to the elite Premier League competition. He turned 17 later in January but was he too young for the demanding event over 17 Thursday nights in 17 locations? He ended up winning the whole thing, and hit a nine-dart finish against Humphries in the final. From Bahrain to Wolverhampton, Littler claimed 10 titles in 2024 and is now eyeing the World Championship.\n", "\n", "As he progressed at the Ally Pally, the Manchester United fan was sent a good luck message by the club's former midfielder and ex-England captain David Beckham. In 12 months, Littler's Instagram followers have risen from 4,000 to 1.3m. Commercial backers include a clothing range, cereal firm and train company and he will appear in a reboot of the TV darts show Bullseye. Google say he was the most searched-for athlete online in the UK during 2024. On the back of his success, Littler darts, boards, cabinets, shirts are being snapped up in big numbers. \"This Christmas the junior magnetic dartboard is selling out, we're talking over 100,000. They're 20 quid and a great introduction for young children,\" said Garry Plummer, the boss of sponsors Target Darts, who first signed a deal with Littler's family when he was aged 12. \"All the toy shops want it, they all want him - 17, clean, doesn't drink, wonderful.\"\n", "\n", "Littler beat Luke Humphries to win the Premier League title in May\n", "\n", - "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like £300,000 for the year. This year it will go to £20m. I expect in five years' time, we'll be playing for £40m.\"\n", + "The number of academies for children under the age of 16 has doubled in the last year, says Junior Darts Corporation chairman Steve Brown. There are 115 dedicated groups offering youngsters equipment, tournaments and a place to develop, with bases including Australia, Bulgaria, Greece, Norway, USA and Mongolia. \"We've seen so many inquiries from around the world, it's been such a boom. It took us 14 years to get 1,600 members and within 12 months we have over 3,000, and waiting lists,\" said Brown. \"When I played darts as a child, I was quite embarrassed to tell my friends what my hobby was. All these kids playing darts now are pretty popular at school. It's a bit rock 'n roll and recognised as a cool thing to do.\" Plans are being hatched to extend the World Championship by four days and increase the number of players from 96 to 128. That will boost the number of tickets available by 25,000 to 115,000 but Hearn reckons he could sell three times as many. He says Saudi Arabia wants to host a tournament, which is likely to happen if no-alcohol regulations are relaxed. \"They will change their rules in the next 12 months probably for certain areas having alcohol, and we'll take darts there and have a party in Saudi,\" he said. \"When I got involved in darts, the total prize money was something like \u00a3300,000 for the year. This year it will go to \u00a320m. I expect in five years' time, we'll be playing for \u00a340m.\"\n", "\n", - "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil ‘The Power’ Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", - "• None Know a lot about Littler? Take our quiz\n", + "Former electrician Cross charged to the 2018 world title in his first full season, while Adrian Lewis and Michael van Gerwen were multiple victors in their 20s and 16-time champion Phil \u2018The Power\u2019 Taylor is widely considered the greatest of all time. Littler is currently fourth in the world rankings, although that is based on a two-year Order of Merit. There have been suggestions from others the spotlight on the teenager means world number one Humphries, 29, has been denied the coverage he deserves, but no darts player has made a mark at such a young age as Littler. \"Luke Humphries is another fabulous player who is going to be around for years. Sport is a very brutal world. It is about winning and claiming the high ground. There will be envy around,\" Hearn said. \"Luke Littler is the next Tiger Woods for darts so they better get used to it, and the only way to compete is to get better.\" World number 38 Martin Lukeman was awestruck as he described facing a peak Littler after being crushed 16-3 in the Grand Slam final, with the teenager winning 15 consecutive legs. \"I can't compete with that, it was like Godly. He was relentless, he is so good it's ridiculous,\" he said. Lukeman can still see the benefits he brings, adding: \"What he's done for the sport is brilliant. If it wasn't for him, our wages wouldn't be going up. There's more sponsors, more money coming in, all good.\" Hearn feels future competition may come from players even younger than Littler. \"I watched a 10-year-old a few months ago who averaged 104.89 and checked out a 4-3 win with a 136 finish. They smell the money, the fame and put the hard work in,\" he said. How much better Littler can get is guesswork, although Plummer believes he wants to reach new heights. \"He never says 'how good was I?' But I think he wants to break records and beat Phil Taylor's 16 World Championships and 16 World Matchplay titles,\" he said. \"He's young enough to do it.\" A version of this article was originally published on 29 November.\n", + "\u2022 None Know a lot about Littler? Take our quiz\n", "--------------------------------------------------------------------------------\n", "Score: 0.6099, Text: Luke Littler has risen from 164th to fourth in the rankings in a year\n", "\n", - "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night – five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", + "A tearful Luke Littler hit a tournament record 140.91 set average as he started his bid for the PDC World Championship title with a dramatic 3-1 win over Ryan Meikle. The 17-year-old made headlines around the world when he reached the tournament final in January, where he lost to Luke Humphries. Starting this campaign on Saturday, Littler was millimetres away from a nine-darter when he missed double 12 as he blew Meikle away in the fourth and final set of the second-round match. Littler was overcome with emotion at the end, cutting short his on-stage interview. \"It was probably the toughest game I've ever played. I had to fight until the end,\" he said later in a news conference. \"As soon as the question came on stage and then boom, the tears came. It was just a bit too much to speak on stage. \"It is the worst game I have played. I have never felt anything like that tonight.\" Admitting to nerves during the match, he told Sky Sports: \"Yes, probably the biggest time it's hit me. Coming into it I was fine, but as soon as [referee] George Noble said 'game on', I couldn't throw them.\" Littler started slowly against Meikle, who had two darts for the opening set, but he took the lead by twice hitting double 20. Meikle did not look overawed against his fellow Englishman and levelled, but Littler won the third set and exploded into life in the fourth. The tournament favourite hit four maximum 180s as he clinched three straight legs in 11, 10 and 11 darts for a record set average, and 100.85 overall. Meanwhile, two seeds crashed out on Saturday night \u2013 five-time world champion Raymond van Barneveld lost to Welshman Nick Kenny, while England's Ryan Joyce beat Danny Noppert. Australian Damon Heta was another to narrowly miss out on a nine-darter, just failing on double 12 when throwing for the match in a 3-1 win over Connor Scutt. Ninth seed Heta hit four 100-plus checkouts to come from a set down against Scutt in a match in which both men averaged more than 97.\n", "\n", "Littler was hugged by his parents after victory over Meikle\n", "\n", @@ -819,11 +819,11 @@ "\n", "Darts player Luke Littler has been named BBC Young Sports Personality of the Year 2024. The 17-year-old has enjoyed a breakthrough year after finishing runner-up at the 2024 PDC World Darts Championship in January. The Englishman, who has won 10 senior titles on the Professional Darts Corporation tour this year, is the first darts player to claim the award. \"It shows how well I have done this year, not only for myself, but I have changed the sport of darts,\" Littler told BBC One. \"I know the amount of academies that have been brought up in different locations, tickets selling out at Ally Pally in hours and the Premier League selling out - it just shows how much I have changed it.\"\n", "\n", - "He was presented with the trophy by Harry Aikines-Aryeetey - a former sprinter who won the award in 2005 - and ex-rugby union player Jodie Ounsley, both of whom are stars of the BBC television show Gladiators. Skateboarder Sky Brown, 16, and Para-swimmer William Ellard, 18, were also shortlisted for the award. Littler became a household name at the start of 2024 by reaching the World Championship final aged just 16 years and 347 days. That achievement was just the start of a trophy-laden year, with Littler winning the Premier League Darts, Grand Slam and World Series of Darts Finals among his haul of titles. Littler has gone from 164th to fourth in the world rankings and earned more than £1m in prize money in 2024. The judging panel for Young Sports Personality of the Year included Paralympic gold medallist Sammi Kinghorn, Olympic silver medal-winning BMX freestyler Keiran Reilly, television presenter Qasa Alom and Radio 1 DJ Jeremiah Asiamah, as well as representatives from the Youth Sport Trust, Blue Peter and BBC Sport.\n", + "He was presented with the trophy by Harry Aikines-Aryeetey - a former sprinter who won the award in 2005 - and ex-rugby union player Jodie Ounsley, both of whom are stars of the BBC television show Gladiators. Skateboarder Sky Brown, 16, and Para-swimmer William Ellard, 18, were also shortlisted for the award. Littler became a household name at the start of 2024 by reaching the World Championship final aged just 16 years and 347 days. That achievement was just the start of a trophy-laden year, with Littler winning the Premier League Darts, Grand Slam and World Series of Darts Finals among his haul of titles. Littler has gone from 164th to fourth in the world rankings and earned more than \u00a31m in prize money in 2024. The judging panel for Young Sports Personality of the Year included Paralympic gold medallist Sammi Kinghorn, Olympic silver medal-winning BMX freestyler Keiran Reilly, television presenter Qasa Alom and Radio 1 DJ Jeremiah Asiamah, as well as representatives from the Youth Sport Trust, Blue Peter and BBC Sport.\n", "--------------------------------------------------------------------------------\n", "Score: 0.5414, Text: Wright is the 17th seed at the World Championship\n", "\n", - "Two-time champion Peter Wright won his opening game at the PDC World Championship, while Ryan Meikle edged out Fallon Sherrock to set up a match against teenage prodigy Luke Littler. Scotland's Wright, the 2020 and 2022 winner, has been out of form this year, but overcame Wesley Plaisier 3-1 in the second round at Alexandra Palace in London. \"It was this crowd that got me through, they wanted me to win. I thank you all,\" said Wright. Meikle came from a set down to claim a 3-2 victory in his first-round match against Sherrock, who was the first woman to win matches at the tournament five years ago. The 28-year-old will now play on Saturday against Littler, who was named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson on Tuesday night. Littler, 17, will be competing on the Ally Pally stage for the first time since his rise to stardom when finishing runner-up in January's world final to Luke Humphries. Earlier on Tuesday, World Grand Prix champion Mike de Decker – the 24th seed - suffered a surprise defeat to Luke Woodhouse in the second round. He is the second seed to exit following 16th seed James Wade's defeat on Monday to Jermaine Wattimena, who meets Wright in round three. Kevin Doets recovered from a set down to win 3-1 against Noa-Lynn van Leuven, who was making history as the first transgender woman to compete in the tournament.\n", + "Two-time champion Peter Wright won his opening game at the PDC World Championship, while Ryan Meikle edged out Fallon Sherrock to set up a match against teenage prodigy Luke Littler. Scotland's Wright, the 2020 and 2022 winner, has been out of form this year, but overcame Wesley Plaisier 3-1 in the second round at Alexandra Palace in London. \"It was this crowd that got me through, they wanted me to win. I thank you all,\" said Wright. Meikle came from a set down to claim a 3-2 victory in his first-round match against Sherrock, who was the first woman to win matches at the tournament five years ago. The 28-year-old will now play on Saturday against Littler, who was named BBC Young Sports Personality of the Year and runner-up in the main award to athlete Keely Hodgkinson on Tuesday night. Littler, 17, will be competing on the Ally Pally stage for the first time since his rise to stardom when finishing runner-up in January's world final to Luke Humphries. Earlier on Tuesday, World Grand Prix champion Mike de Decker \u2013 the 24th seed - suffered a surprise defeat to Luke Woodhouse in the second round. He is the second seed to exit following 16th seed James Wade's defeat on Monday to Jermaine Wattimena, who meets Wright in round three. Kevin Doets recovered from a set down to win 3-1 against Noa-Lynn van Leuven, who was making history as the first transgender woman to compete in the tournament.\n", "\n", "Sherrock drew level at 2-2 but lost the final set to Meikle\n", "\n", @@ -869,15 +869,15 @@ "\n", "Two-time champion Gary Anderson has been dumped out of the PDC World Championship on his 54th birthday by Jeffrey de Graaf. The Scot, winner in 2015 and 2016, lost 3-0 to the Swede in a second-round shock at Alexandra Palace in London. \"Gary didn't really show up as he usually does. I'm very happy with the win,\" said De Graaf, 34, who had a 75% checkout success and began with an 11-dart finish. \"It's a dream come true for me. He's been my idol since I was 14 years old.\" Anderson, ranked 14th, became the 11th seed to be knocked out from the 24 who have played so far, and the fifth to fall on Sunday.\n", "\n", - "He came into the competition with the year's highest overall three-dart average of 99.66 but hit just three of his 20 checkout attempts to lose his opening match of the tournament for the first time. De Graaf will now meet Filipino qualifier Paolo Nebrida after he stunned England's Ross Smith, the 19th seed, in straight sets. Ritchie Edhouse, Dirk van Duijvenbode and Martin Schindler were the other seeds beaten on day eight. England's Callan Rydz, who hit a record first-round average of 107.06 on Thursday, followed up with a 3-0 win over 23rd seed Schindler on Sunday. The German missed double 12 for a nine-darter in the first set – the third player to do so in 24 hours after Luke Littler and Damon Heta – and ended up losing the leg. Rydz next meets Belgian Dimitri van den Bergh, who hit six 180s and averaged 96 in a 3-0 win over Irishman Dylan Slevin.\n", + "He came into the competition with the year's highest overall three-dart average of 99.66 but hit just three of his 20 checkout attempts to lose his opening match of the tournament for the first time. De Graaf will now meet Filipino qualifier Paolo Nebrida after he stunned England's Ross Smith, the 19th seed, in straight sets. Ritchie Edhouse, Dirk van Duijvenbode and Martin Schindler were the other seeds beaten on day eight. England's Callan Rydz, who hit a record first-round average of 107.06 on Thursday, followed up with a 3-0 win over 23rd seed Schindler on Sunday. The German missed double 12 for a nine-darter in the first set \u2013 the third player to do so in 24 hours after Luke Littler and Damon Heta \u2013 and ended up losing the leg. Rydz next meets Belgian Dimitri van den Bergh, who hit six 180s and averaged 96 in a 3-0 win over Irishman Dylan Slevin.\n", "\n", "England's Joe Cullen abruptly left his post-match news conference and accused the media of not showing him respect after his 3-0 win over Dutchman Wessel Nijman. Nijman, who has previously served a ban for breaching betting and anti-corruption rules, had been billed as favourite beforehand to beat 23rd seed Cullen. \"Honestly, the media attention that Wessel's got, again this is not a reflection on him,\" Cullen said. \"He seems like a fantastic kid, he's been caught up in a few things beforehand, but he's served his time and he's held his hands up, like a lot haven't. \"I think the way I've been treated probably with the media and things like that - I know you guys have no control over the bookies - I've been shown no respect, so I won't be showing any respect to any of you guys tonight. \"I'm going to go home. Cheers.\" Ian 'Diamond' White beat European champion and 29th seed Edhouse 3-1 and will face teenage star Littler in the next round. White, born in the same Cheshire town as the 17-year-old, acknowledged he would need to up his game in round three. Asked if he knew who was waiting for him, White joked: \"Yeah, Runcorn's number two. I'm from Runcorn and I'm number one.\" Ryan Searle started Sunday afternoon's action off with a 10-dart leg and went on to beat Matt Campbell 3-0, while Latvian Madars Razma defeated 25th seed Van Duijvenbode 3-1. Seventh seed Jonny Clayton and 2018 champion Rob Cross are among the players in action on Monday as the second round concludes. The third round will start on Friday after a three-day break for Christmas.\n", "--------------------------------------------------------------------------------\n", "Score: 0.5105, Text: Christian Kist was sealing his first televised nine-darter\n", "\n", - "Christian Kist hit a nine-darter but lost his PDC World Championship first-round match to Madars Razma. The Dutchman became the first player to seal a perfect leg in the tournament since Michael Smith did so on the way to beating Michael van Gerwen in the 2023 final. Kist, the 2012 BDO world champion at Lakeside, collects £60,000 for the feat, with the same amount being awarded by sponsors to a charity and to one spectator inside Alexandra Palace in London. The 38-year-old's brilliant finish sealed the opening set, but his Latvian opponent bounced back to win 3-1. Darts is one of the few sports that can measure perfection; snooker has the 147 maximum break, golf has the hole-in-one, darts has the nine-dart finish. Kist scored two maximum 180s to leave a 141 checkout which he completed with a double 12, to the delight of more than 3,000 spectators. The English 12th seed, who has been troubled by wrist and back injuries, could next play Andrew Gilding in the third round - which begins on 27 December - should Gilding beat the winner of Martin Lukeman's match against qualifier Nitin Kumar. Aspinall faces a tough task to reach the last four again, with 2018 champion Rob Cross and 2024 runner-up Luke Littler both in his side of the draw.\n", + "Christian Kist hit a nine-darter but lost his PDC World Championship first-round match to Madars Razma. The Dutchman became the first player to seal a perfect leg in the tournament since Michael Smith did so on the way to beating Michael van Gerwen in the 2023 final. Kist, the 2012 BDO world champion at Lakeside, collects \u00a360,000 for the feat, with the same amount being awarded by sponsors to a charity and to one spectator inside Alexandra Palace in London. The 38-year-old's brilliant finish sealed the opening set, but his Latvian opponent bounced back to win 3-1. Darts is one of the few sports that can measure perfection; snooker has the 147 maximum break, golf has the hole-in-one, darts has the nine-dart finish. Kist scored two maximum 180s to leave a 141 checkout which he completed with a double 12, to the delight of more than 3,000 spectators. The English 12th seed, who has been troubled by wrist and back injuries, could next play Andrew Gilding in the third round - which begins on 27 December - should Gilding beat the winner of Martin Lukeman's match against qualifier Nitin Kumar. Aspinall faces a tough task to reach the last four again, with 2018 champion Rob Cross and 2024 runner-up Luke Littler both in his side of the draw.\n", "\n", - "Kist - who was knocked out of last year's tournament by teenager Littler - will still earn a bigger cheque than he would have got for a routine run to the quarter-finals. His nine-darter was the 15th in the history of the championship and first since the greatest leg in darts history when Smith struck, moments after Van Gerwen just missed his attempt. Darts fan Kris, a railway worker from Sutton in south London, was the random spectator picked out to receive £60,000, with Prostate Cancer UK getting the same sum from tournament sponsors Paddy Power. \"I'm speechless to be honest. I didn't expect it to happen to me,\" Kris said. \"This was a birthday present so it makes it even better. My grandad got me tickets. It was just a normal day - I came here after work.\" Kist said: \"Hitting the double 12 felt amazing. It was a lovely moment for everyone and I hope Kris enjoys the money. Maybe I will go on vacation next month.\" Earlier, Jim Williams was favourite against Paolo Nebrida but lost 3-2 in an epic lasting more than an hour. The Filipino took a surprise 2-1 lead and Williams only went ahead for the first time in the opening leg of the deciding set. The Welshman looked on course for victory but missed five match darts. UK Open semi-finalist Ricky Evans set up a second-round match against Dave Chisnall, checking out on 109 to edge past Gordon Mathers 3-2.\n", + "Kist - who was knocked out of last year's tournament by teenager Littler - will still earn a bigger cheque than he would have got for a routine run to the quarter-finals. His nine-darter was the 15th in the history of the championship and first since the greatest leg in darts history when Smith struck, moments after Van Gerwen just missed his attempt. Darts fan Kris, a railway worker from Sutton in south London, was the random spectator picked out to receive \u00a360,000, with Prostate Cancer UK getting the same sum from tournament sponsors Paddy Power. \"I'm speechless to be honest. I didn't expect it to happen to me,\" Kris said. \"This was a birthday present so it makes it even better. My grandad got me tickets. It was just a normal day - I came here after work.\" Kist said: \"Hitting the double 12 felt amazing. It was a lovely moment for everyone and I hope Kris enjoys the money. Maybe I will go on vacation next month.\" Earlier, Jim Williams was favourite against Paolo Nebrida but lost 3-2 in an epic lasting more than an hour. The Filipino took a surprise 2-1 lead and Williams only went ahead for the first time in the opening leg of the deciding set. The Welshman looked on course for victory but missed five match darts. UK Open semi-finalist Ricky Evans set up a second-round match against Dave Chisnall, checking out on 109 to edge past Gordon Mathers 3-2.\n", "--------------------------------------------------------------------------------\n" ] } @@ -912,9 +912,9 @@ "metadata": {}, "source": [ "## Retrieval-Augmented Generation (RAG) with Couchbase and LangChain\n", - "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query’s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", + "Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query\u2019s embedding with the stored document embeddings. These documents, which provide contextual information, are then passed to a generative language model within LangChain.\n", "\n", - "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase’s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." + "The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase\u2019s efficient storage and retrieval capabilities, while LangChain handles the generation of responses based on the context provided by the retrieved documents. Together, they create a powerful system that can deliver highly relevant and accurate answers by combining the strengths of both retrieval and generation." ] }, { @@ -968,7 +968,7 @@ " - Despite a slow start and admitted nerves, he secured a **3-1 victory** with a dominant fourth set, hitting **four maximum 180s** and maintaining an overall match average of **100.85**.\n", "\n", "4. **Emotional Impact**: \n", - " - The 17-year-old became emotional post-match, cutting short his on-stage interview due to the intensity of the moment, later calling it the \"toughest game\" he’d ever played.\n", + " - The 17-year-old became emotional post-match, cutting short his on-stage interview due to the intensity of the moment, later calling it the \"toughest game\" he\u2019d ever played.\n", "\n", "These achievements highlight his resilience and skill, further cementing his status as a rising star in darts.\n", "RAG response generated in 21.84 seconds\n" @@ -1017,17 +1017,17 @@ "\n", "1. **Red Card Incident**: Liverpool played most of the match with 10 men after Andy Robertson received a red card in the 17th minute for denying a goalscoring opportunity. He had earlier been injured in a tackle by Fulham's Issa Diop.\n", "\n", - "2. **Comeback Resilience**: Despite the numerical disadvantage, Liverpool twice came from behind. Diogo Jota scored an 86th-minute equalizer to secure a point. Fulham's Antonee Robinson praised Liverpool, noting it \"didn’t feel like they had 10 men\" due to their aggressive, high-pressing approach.\n", + "2. **Comeback Resilience**: Despite the numerical disadvantage, Liverpool twice came from behind. Diogo Jota scored an 86th-minute equalizer to secure a point. Fulham's Antonee Robinson praised Liverpool, noting it \"didn\u2019t feel like they had 10 men\" due to their aggressive, high-pressing approach.\n", "\n", "3. **Performance Metrics**: Liverpool dominated possession (over 60%) and led in key attacking stats (shots, big chances, touches in the opposition box), showcasing their determination even with a player deficit.\n", "\n", "4. **Manager & Player Reactions**: \n", - " - Manager Arne Slot commended his team’s \"outstanding\" character and resilience, particularly highlighting Robertson’s effort despite the red card.\n", - " - Captain Virgil van Dijk emphasized the team’s ability to \"stay calm\" and fight back under pressure.\n", + " - Manager Arne Slot commended his team\u2019s \"outstanding\" character and resilience, particularly highlighting Robertson\u2019s effort despite the red card.\n", + " - Captain Virgil van Dijk emphasized the team\u2019s ability to \"stay calm\" and fight back under pressure.\n", "\n", - "5. **League Impact**: The draw extended Liverpool’s lead at the top of the Premier League to five points, as rivals Arsenal also dropped points. Pundits, including Chris Sutton, lauded Liverpool’s \"phenomenal\" response to adversity. \n", + "5. **League Impact**: The draw extended Liverpool\u2019s lead at the top of the Premier League to five points, as rivals Arsenal also dropped points. Pundits, including Chris Sutton, lauded Liverpool\u2019s \"phenomenal\" response to adversity. \n", "\n", - "Fulham’s strong performance, described as \"brave,\" was also acknowledged, making the match a thrilling encounter between both sides.\n", + "Fulham\u2019s strong performance, described as \"brave,\" was also acknowledged, making the match a thrilling encounter between both sides.\n", "Time taken: 14.14 seconds\n", "\n", "Query 2: What were Luke Littler's key achievements and records in his recent PDC World Championship match?\n", @@ -1043,7 +1043,7 @@ " - Despite a slow start and admitted nerves, he secured a **3-1 victory** with a dominant fourth set, hitting **four maximum 180s** and maintaining an overall match average of **100.85**.\n", "\n", "4. **Emotional Impact**: \n", - " - The 17-year-old became emotional post-match, cutting short his on-stage interview due to the intensity of the moment, later calling it the \"toughest game\" he’d ever played.\n", + " - The 17-year-old became emotional post-match, cutting short his on-stage interview due to the intensity of the moment, later calling it the \"toughest game\" he\u2019d ever played.\n", "\n", "These achievements highlight his resilience and skill, further cementing his status as a rising star in darts.\n", "Time taken: 1.82 seconds\n", @@ -1053,17 +1053,17 @@ "\n", "1. **Red Card Incident**: Liverpool played most of the match with 10 men after Andy Robertson received a red card in the 17th minute for denying a goalscoring opportunity. He had earlier been injured in a tackle by Fulham's Issa Diop.\n", "\n", - "2. **Comeback Resilience**: Despite the numerical disadvantage, Liverpool twice came from behind. Diogo Jota scored an 86th-minute equalizer to secure a point. Fulham's Antonee Robinson praised Liverpool, noting it \"didn’t feel like they had 10 men\" due to their aggressive, high-pressing approach.\n", + "2. **Comeback Resilience**: Despite the numerical disadvantage, Liverpool twice came from behind. Diogo Jota scored an 86th-minute equalizer to secure a point. Fulham's Antonee Robinson praised Liverpool, noting it \"didn\u2019t feel like they had 10 men\" due to their aggressive, high-pressing approach.\n", "\n", "3. **Performance Metrics**: Liverpool dominated possession (over 60%) and led in key attacking stats (shots, big chances, touches in the opposition box), showcasing their determination even with a player deficit.\n", "\n", "4. **Manager & Player Reactions**: \n", - " - Manager Arne Slot commended his team’s \"outstanding\" character and resilience, particularly highlighting Robertson’s effort despite the red card.\n", - " - Captain Virgil van Dijk emphasized the team’s ability to \"stay calm\" and fight back under pressure.\n", + " - Manager Arne Slot commended his team\u2019s \"outstanding\" character and resilience, particularly highlighting Robertson\u2019s effort despite the red card.\n", + " - Captain Virgil van Dijk emphasized the team\u2019s ability to \"stay calm\" and fight back under pressure.\n", "\n", - "5. **League Impact**: The draw extended Liverpool’s lead at the top of the Premier League to five points, as rivals Arsenal also dropped points. Pundits, including Chris Sutton, lauded Liverpool’s \"phenomenal\" response to adversity. \n", + "5. **League Impact**: The draw extended Liverpool\u2019s lead at the top of the Premier League to five points, as rivals Arsenal also dropped points. Pundits, including Chris Sutton, lauded Liverpool\u2019s \"phenomenal\" response to adversity. \n", "\n", - "Fulham’s strong performance, described as \"brave,\" was also acknowledged, making the match a thrilling encounter between both sides.\n", + "Fulham\u2019s strong performance, described as \"brave,\" was also acknowledged, making the match a thrilling encounter between both sides.\n", "Time taken: 1.52 seconds\n" ] } @@ -1124,4 +1124,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/openrouter-deepseek/fts/deepseek_index.json b/openrouter-deepseek/search_based/deepseek_index.json similarity index 100% rename from openrouter-deepseek/fts/deepseek_index.json rename to openrouter-deepseek/search_based/deepseek_index.json diff --git a/openrouter-deepseek/fts/frontmatter.md b/openrouter-deepseek/search_based/frontmatter.md similarity index 100% rename from openrouter-deepseek/fts/frontmatter.md rename to openrouter-deepseek/search_based/frontmatter.md diff --git a/pydantic_ai/fts/RAG_with_Couchbase_and_PydanticAI.ipynb b/pydantic_ai/fts/RAG_with_Couchbase_and_PydanticAI.ipynb deleted file mode 100644 index 53fb2d7d..00000000 --- a/pydantic_ai/fts/RAG_with_Couchbase_and_PydanticAI.ipynb +++ /dev/null @@ -1,899 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "kNdImxzypDlm" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com) as the embedding and LLM provider, and [PydanticAI](https://ai.pydantic.dev) as an agent orchestrator. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-pydantic-ai-couchbase-rag-with-global-secondary-index)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively.\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NH2o6pqa69oG" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "DYhPj0Ta8l_A" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install --quiet -U datasets==3.5.0 langchain-couchbase==0.3.0 langchain-openai==0.3.13 python-dotenv==1.1.0 pydantic-ai==0.1.1 ipywidgets==8.1.6" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1pp7GtNg8mB9" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": { - "id": "8GzS6tfL8mFP" - }, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from uuid import uuid4\n", - "from datetime import timedelta\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", - "from langchain_openai import OpenAIEmbeddings\n", - "from tqdm import tqdm\n", - "\n", - "from dataclasses import dataclass\n", - "from pydantic_ai import Agent, RunContext" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pBnMp5vb8mIb" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": { - "id": "Yv8kWcuf8mLx" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K9G5a0en8mPA" - }, - "source": [ - "# Loading Sensitive Information\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": { - "id": "PFGyHll18mSe" - }, - "outputs": [], - "source": [ - "load_dotenv()\n", - "\n", - "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API Key: ')\n", - "\n", - "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", - "INDEX_NAME = os.getenv('INDEX_NAME') or input('Enter your index name (default: vector_search_pydantic_ai): ') or 'vector_search_pydantic_ai'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: pydantic_ai): ') or 'pydantic_ai'\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not OPENAI_API_KEY:\n", - " raise ValueError(\"Missing OpenAI API Key\")\n", - "\n", - "if 'OPENAI_API_KEY' not in os.environ:\n", - " os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtGrYzUY8mV3" - }, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": { - "id": "Zb3kK-7W8mZK" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-04-11 13:54:19,537 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C_Gpy32N8mcZ" - }, - "source": [ - "# Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Creates primary index on collection for query performance\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": { - "id": "ACZcwUnG8mf2" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-04-11 13:54:23,668 - INFO - Bucket 'vector-search-testing' does not exist. Creating it...\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-04-11 13:54:25,721 - INFO - Bucket 'vector-search-testing' created successfully.\n", - "2025-04-11 13:54:25,728 - INFO - Scope 'shared' does not exist. Creating it...\n", - "2025-04-11 13:54:25,777 - INFO - Scope 'shared' created successfully.\n", - "2025-04-11 13:54:25,796 - INFO - Collection 'pydantic_ai' does not exist. Creating it...\n", - "2025-04-11 13:54:27,843 - INFO - Collection 'pydantic_ai' created successfully.\n", - "2025-04-11 13:54:28,120 - INFO - Primary index present or created successfully.\n", - "2025-04-11 13:54:28,133 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " time.sleep(2)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists.Skipping creation.\")\n", - "\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Ensure primary index exists\n", - " try:\n", - " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", - " logging.info(\"Primary index present or created successfully.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error creating primary index: {str(e)}\")\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - "\n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NMJ7RRYp8mjV" - }, - "source": [ - "# Loading Couchbase Vector Search Index\n", - "\n", - "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", - "\n", - "This vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `pydantic_ai`. The configuration is set up for vectors with exactly `1536 dimensions`, using dot product similarity and optimized for recall. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", - "\n", - "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": { - "id": "y7xiCrOc8mmj" - }, - "outputs": [], - "source": [ - "# If you are running this script locally (not in Google Colab), uncomment the following line\n", - "# and provide the path to your index definition file.\n", - "\n", - "# index_definition_path = '/path_to_your_index_file/pydantic_ai_index.json' # Local setup: specify your file path here\n", - "\n", - "# # Version for Google Colab\n", - "# def load_index_definition_colab():\n", - "# from google.colab import files\n", - "# print(\"Upload your index definition file\")\n", - "# uploaded = files.upload()\n", - "# index_definition_path = list(uploaded.keys())[0]\n", - "\n", - "# try:\n", - "# with open(index_definition_path, 'r') as file:\n", - "# index_definition = json.load(file)\n", - "# return index_definition\n", - "# except Exception as e:\n", - "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Version for Local Environment\n", - "def load_index_definition_local(index_definition_path):\n", - " try:\n", - " with open(index_definition_path, 'r') as file:\n", - " index_definition = json.load(file)\n", - " return index_definition\n", - " except Exception as e:\n", - " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Usage\n", - "# Uncomment the appropriate line based on your environment\n", - "# index_definition = load_index_definition_colab()\n", - "index_definition = load_index_definition_local('pydantic_ai_index.json')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v_ddPQ_Y8mpm" - }, - "source": [ - "# Creating or Updating Search Indexes\n", - "\n", - "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": { - "id": "bHEpUu1l8msx" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-04-11 13:54:41,157 - INFO - Creating new index 'vector-search-testing.shared.vector_search_pydantic_ai'...\n", - "2025-04-11 13:54:41,316 - INFO - Index 'vector-search-testing.shared.vector_search_pydantic_ai' successfully created/updated.\n" - ] - } - ], - "source": [ - "try:\n", - " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", - "\n", - " # Check if index already exists\n", - " existing_indexes = scope_index_manager.get_all_indexes()\n", - " index_name = index_definition[\"name\"]\n", - "\n", - " if index_name in [index.name for index in existing_indexes]:\n", - " logging.info(f\"Index '{index_name}' found\")\n", - " else:\n", - " logging.info(f\"Creating new index '{index_name}'...\")\n", - "\n", - " # Create SearchIndex object from JSON definition\n", - " search_index = SearchIndex.from_json(index_definition)\n", - "\n", - " # Upsert the index (create if not exists, update if exists)\n", - " scope_index_manager.upsert_index(search_index)\n", - " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", - "\n", - "except QueryIndexAlreadyExistsException:\n", - " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", - "\n", - "except InternalServerFailureException as e:\n", - " error_message = str(e)\n", - " logging.error(f\"InternalServerFailureException raised: {error_message}\")\n", - "\n", - " try:\n", - " # Accessing the response_body attribute from the context\n", - " error_context = e.context\n", - " response_body = error_context.response_body\n", - " if response_body:\n", - " error_details = json.loads(response_body)\n", - " error_message = error_details.get('error', '')\n", - "\n", - " if \"collection: 'pydantic_ai' doesn't belong to scope: 'shared'\" in error_message:\n", - " raise ValueError(\"Collection 'pydantic_ai' does not belong to scope 'shared'. Please check the collection and scope names.\")\n", - "\n", - " except ValueError as ve:\n", - " logging.error(str(ve))\n", - " raise\n", - "\n", - " except Exception as json_error:\n", - " logging.error(f\"Failed to parse the error message: {json_error}\")\n", - " raise RuntimeError(f\"Internal server error while creating/updating search index: {error_message}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7FvxRsg38m3G" - }, - "source": [ - "# Creating OpenAI Embeddings\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents." - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": { - "id": "_75ZyCRh8m6m" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-04-11 13:55:10,426 - INFO - Successfully created OpenAIEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = OpenAIEmbeddings(\n", - " model=\"text-embedding-3-small\",\n", - " api_key=OPENAI_API_KEY,\n", - " )\n", - " logging.info(\"Successfully created OpenAIEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8IwZMUnF8m-N" - }, - "source": [ - "# Setting Up the Couchbase Vector Store\n", - "The vector store is set up to manage the embeddings created in the previous step. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is built on top of Couchbase, allowing the script to store the embeddings in a way that can be efficiently searched." - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": { - "id": "DwIJQjYT9RV_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-04-11 13:55:12,849 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseSearchVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " index_name=INDEX_NAME,\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-04-11 13:55:22,967 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "With the Vector store set up, the next step is to populate it with data. We save the BBC articles dataset to the vector store. For each document, we will generate the embeddings for the article to use with the semantic search using LangChain. Here one of the articles is larger than the maximum tokens that we can use for our embedding model. If we want to ingest that document, we could split the document and ingest it in parts. However, since it is only a single document for simplicity, we ignore that document from the ingestion process." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [], - "source": [ - "# Save the current logging level\n", - "current_logging_level = logging.getLogger().getEffectiveLevel()\n", - "\n", - "# # Set logging level to CRITICAL to suppress lower level logs\n", - "logging.getLogger().setLevel(logging.CRITICAL)\n", - "\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles\n", - " )\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n", - "\n", - "# Restore the original logging level\n", - "logging.getLogger().setLevel(current_logging_level)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# PydanticAI: An Introduction\n", - "From [PydanticAI](https://ai.pydantic.dev/)'s website:\n", - "\n", - "> PydanticAI is a Python agent framework designed to make it less painful to build production grade applications with Generative AI.\n", - "\n", - "PydanticAI allows us to define agents and tools easily to create Gen-AI apps in an innovative and painless manner. Some of its features are:\n", - "- Built by the Pydantic Team: Built by the team behind Pydantic (the validation layer of the OpenAI SDK, the Anthropic SDK, LangChain, LlamaIndex, AutoGPT, Transformers, CrewAI, Instructor and many more).\n", - "\n", - "- Model-agnostic: Supports OpenAI, Anthropic, Gemini, Deepseek, Ollama, Groq, Cohere, and Mistral, and there is a simple interface to implement support for other models.\n", - "\n", - "- Type-safe: Designed to make type checking as powerful and informative as possible for you.\n", - "\n", - "- Python-centric Design: Leverages Python's familiar control flow and agent composition to build your AI-driven projects, making it easy to apply standard Python best practices you'd use in any other (non-AI) project.\n", - "\n", - "- Structured Responses: Harnesses the power of Pydantic to validate and structure model outputs, ensuring responses are consistent across runs.\n", - "\n", - "- Dependency Injection System: Offers an optional dependency injection system to provide data and services to your agent's system prompts, tools and result validators. This is useful for testing and eval-driven iterative development.\n", - "\n", - "- Streamed Responses: Provides the ability to stream LLM outputs continuously, with immediate validation, ensuring rapid and accurate results.\n", - "\n", - "- Graph Support: Pydantic Graph provides a powerful way to define graphs using typing hints, this is useful in complex applications where standard control flow can degrade to spaghetti code.\n", - "\n", - "# Building a RAG Agent using PydanticAI\n", - "\n", - "PydanticAI makes heavy use of dependency injection to provide data and services to your agent's system prompts and tools. We define dependencies using a `dataclass`, which serves as a container for our dependencies.\n", - "\n", - "In our case, the only dependency for our agent to work in the `CouchbaseSearchVectorStore` instance. However, we will still use a `dataclass` as it is good practice. In the future, in case we wish to add more dependencies, we can just add more fields to the `dataclass` `Deps`.\n", - "\n", - "We also initialize an agent as a GPT-4o model. PydanticAI supports many different LLM providers, including Anthropic, Google, Cohere, etc. which can also be used. While initializing the agent, we also pass the type of the dependencies. This is mainly used for type checking, and not actually used at runtime." - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [], - "source": [ - "@dataclass\n", - "class Deps:\n", - " vector_store: CouchbaseSearchVectorStore\n", - "\n", - "agent = Agent(\"openai:gpt-4o\", deps_type=Deps)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Defining the Vector Store as a Tool\n", - "PydanticAI has the concept of `function tools`, which are functions that can be called by LLMs to retrieve extra information that can help form a better response.\n", - "\n", - "We can perform RAG by creating a tool which retrieves documents that are semantically similar to the query, and allowing the agent to call the tool when required. We can add the function as a tool using the `@agent.tool` decorator.\n", - "\n", - "Notice that we also add the `context` parameter, which contains the dependencies that are passed to the tool (in this case, the only dependency is the vector store)." - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [], - "source": [ - "@agent.tool\n", - "async def retrieve(context: RunContext[Deps], search_query: str) -> str:\n", - " \"\"\"Retrieve news data based on a search query.\n", - "\n", - " Args:\n", - " context: The call context\n", - " search_query: The search query\n", - " \"\"\"\n", - " search_results = context.deps.vector_store.similarity_search_with_score(search_query, k=5)\n", - " return \"\\n\\n\".join(\n", - " f\"# Documents:\\n{doc.page_content}\"\n", - " for doc, score in search_results\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, we create a function that allows us to define our dependencies and run our agent." - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [], - "source": [ - "async def run_agent(question: str):\n", - " deps = Deps(\n", - " vector_store=vector_store,\n", - " )\n", - " answer = await agent.run(question, deps=deps)\n", - " return answer" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Running our Agent\n", - "We have now finished setting up our vector store and agent! The system is now ready to accept queries." - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-04-11 13:56:53,839 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n", - "2025-04-11 13:56:54,485 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n", - "2025-04-11 13:57:01,928 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "==================== Agent Output ====================\n", - "Pep Guardiola has expressed a mix of determination and concern regarding Manchester City's current form. He acknowledged the personal impact of the team's downturn, admitting that the situation has affected his sleep and diet due to the worst run of results he has ever faced in his managerial career. Guardiola described his state of mind as \"ugly,\" noting the team's precarious position in competitions and the need to defend better and avoid mistakes.\n", - "\n", - "Despite these challenges, Guardiola remains committed to finding solutions, emphasizing the need to improve defensive concepts and restore the team's intensity and form. He acknowledged the errors from some of the best players in the world and expressed a need for the team to stay positive and for players to have the necessary support to overcome their current struggles.\n", - "\n", - "Moreover, Guardiola expressed a pragmatic view of the situation, accepting that the team must \"survive\" the season and acknowledging a potential need for a significant rebuild to address the challenges they're facing. As a testament to his commitment, he noted his intention to continue shaping the club during his newly extended contract period. Throughout, he reiterated his belief in the team and emphasized the need to find a way forward.\n" - ] - } - ], - "source": [ - "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - "output = await run_agent(query)\n", - "\n", - "print(\"=\" * 20, \"Agent Output\", \"=\" * 20)\n", - "print(output.data)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Inspecting the Agent\n", - "We can use the `all_messages()` method in the output object to observe how the agent and tools work.\n", - "\n", - "In the cell below, we see an extremely detailed list of all the model's messages and tool calls, which happens step by step:\n", - "1. The `UserPromptPart`, which consists of the query the user sends to the agent.\n", - "2. The agent calls the `retrieve` tool in the `ToolCallPart` message. This includes the `search_query` argument. Couchbase uses this `search_query` to perform semantic search over all the ingested news articles.\n", - "3. The `retrieve` tool returns a `ToolReturnPart` object with all the context required for the model to answer the user's query. The retrieve documents were truncated, because a large amount of context was retrieved. \n", - "4. The final message is the LLM generated response with the added context, which is sent back to the user." - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Step 1:\n", - "('ModelRequest(parts=[UserPromptPart(content=\"What was manchester city manager '\n", - " 'pep guardiola\\'s reaction to the team\\'s current form?\", '\n", - " 'timestamp=datetime.datetime(2025, 4, 11, 8, 26, 52, 836357, '\n", - " \"tzinfo=datetime.timezone.utc), part_kind='user-prompt')], kind='request')\")\n", - "==================================================\n", - "Step 2:\n", - "(\"ModelResponse(parts=[ToolCallPart(tool_name='retrieve', \"\n", - " 'args=\\'{\"search_query\":\"Pep Guardiola reaction to Manchester City current '\n", - " 'form\"}\\', tool_call_id=\\'call_oo4Jjn93VkRJ3q9PnAwkt3xm\\', '\n", - " \"part_kind='tool-call')], model_name='gpt-4o-2024-08-06', \"\n", - " 'timestamp=datetime.datetime(2025, 4, 11, 8, 26, 53, '\n", - " \"tzinfo=datetime.timezone.utc), kind='response')\")\n", - "==================================================\n", - "Step 3:\n", - "(\"ModelRequest(parts=[ToolReturnPart(tool_name='retrieve', content='# \"\n", - " 'Documents:\\\\nManchester City boss Pep Guardiola has won 18 trophies since he '\n", - " 'arrived at the club in 2016\\\\n\\\\nManchester City boss Pep Guardiola says he '\n", - " 'is \"fine\" despite admitting his sleep and diet are being affected by the '\n", - " 'worst run of results in his entire managerial career. In an interview with '\n", - " 'former Italy international Luca Toni for Amazon Prime Sport before '\n", - " \"Wednesday\\\\'s Champions League defeat by Juventus, Guardiola touched on the \"\n", - " \"personal impact City\\\\'s sudden downturn in form has had. Guardiola said his \"\n", - " 'state of mind was \"ugly\", that his sleep was \"worse\" and he was eating '\n", - " \"lighter as his digestion had suffered. City go into Sunday\\\\'s derby against \"\n", - "\n... (output truncated for brevity)\n" - ] - } - ], - "source": [ - "from pprint import pprint\n", - "\n", - "for idx, message in enumerate(output.all_messages(), start=1):\n", - " print(f\"Step {idx}:\")\n", - " pprint(message.__repr__())\n", - " print(\"=\" * 50)" - ] - } - ], - "metadata": { - "colab": { - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.11" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file diff --git a/pydantic_ai/gsi/.env.sample b/pydantic_ai/query_based/.env.sample similarity index 100% rename from pydantic_ai/gsi/.env.sample rename to pydantic_ai/query_based/.env.sample diff --git a/pydantic_ai/gsi/RAG_with_Couchbase_and_PydanticAI.ipynb b/pydantic_ai/query_based/RAG_with_Couchbase_and_PydanticAI.ipynb similarity index 94% rename from pydantic_ai/gsi/RAG_with_Couchbase_and_PydanticAI.ipynb rename to pydantic_ai/query_based/RAG_with_Couchbase_and_PydanticAI.ipynb index eead50ce..10489b5e 100644 --- a/pydantic_ai/gsi/RAG_with_Couchbase_and_PydanticAI.ipynb +++ b/pydantic_ai/query_based/RAG_with_Couchbase_and_PydanticAI.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com) as the embedding and LLM provider, and [PydanticAI](https://ai.pydantic.dev) as an agent orchestrator. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using GSI (Global Secondary Index) from scratch. Alternatively if you want to perform semantic search using the FTS index, please take a look at [this.](https://developer.couchbase.com/tutorial-pydantic-ai-couchbase-rag-with-fts/)\n" + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com) as the embedding and LLM provider, and [PydanticAI](https://ai.pydantic.dev) as an agent orchestrator. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Hyperscale and Composite Vector Indexes from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Search Vector Index, please take a look at [this.](https://developer.couchbase.com/tutorial-pydantic-ai-couchbase-rag-with-search-vector-index/)\n" ] }, { @@ -363,7 +363,7 @@ "source": [ "# Understanding GSI Vector Search\n", "\n", - "### Optimizing Vector Search with Global Secondary Index (GSI)\n", + "### Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", "\n", "With Couchbase 8.0+, you can leverage the power of GSI-based vector search, which offers significant performance improvements over traditional Full-Text Search (FTS) approaches for vector-first workloads. GSI vector search provides high-performance vector similarity search with advanced filtering capabilities and is designed to scale to billions of vectors.\n", "\n", @@ -398,12 +398,12 @@ "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", "- **Features**: \n", " - Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", - " - Best for well-defined workloads requiring complex filtering using GSI features\n", + " - Best for well-defined workloads requiring complex filtering using Hyperscale and Composite Vector Index features\n", " - Supports range lookups combined with vector search\n", "\n", "#### Index Type Selection for This Tutorial\n", "\n", - "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using GSI. BHIVE is ideal for semantic search scenarios where you want:\n", + "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using Hyperscale and Composite Vector Indexes. BHIVE is ideal for semantic search scenarios where you want:\n", "\n", "1. **High-performance vector search** across large datasets\n", "2. **Low latency** for real-time applications\n", @@ -684,8 +684,8 @@ "\n", "## Performance Testing Phases\n", "\n", - "1. **Phase 1 - Baseline Performance**: Test vector search without GSI indexes to establish baseline metrics\n", - "2. **Phase 2 - GSI-Optimized Search**: Create BHIVE index and measure performance improvements\n", + "1. **Phase 1 - Baseline Performance**: Test vector search without Hyperscale or Composite Vector Indexes to establish baseline metrics\n", + "2. **Phase 2 - Vector Index-Optimized Search**: Create BHIVE index and measure performance improvements\n", "\n", "**Important Context:**\n", "- GSI performance benefits scale with dataset size and concurrent load\n", @@ -816,9 +816,9 @@ "metadata": {}, "source": [ "\n", - "# Optimizing Vector Search with Global Secondary Index (GSI)\n", + "# Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", "\n", - "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Global Secondary Index (GSI) in Couchbase.\n", + "While the above semantic search using similarity_search_with_score works effectively, we can significantly improve query performance by leveraging Couchbase Hyperscale and Composite Vector Indexes in Couchbase.\n", "\n", "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", "\n", @@ -867,7 +867,7 @@ "\n", "For detailed configuration options, see the [Quantization & Centroid Settings](https://preview.docs-test.couchbase.com/docs-server-DOC-12565_vector_search_concepts/server/current/vector-index/hyperscale-vector-index.html#algo_settings).\n", "\n", - "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI." + "In the code below, we demonstrate creating a BHIVE index. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, Hyperscale and Composite Vector indexes can be created manually from the Couchbase UI." ] }, { @@ -919,7 +919,7 @@ "output_type": "stream", "text": [ "\n", - "GSI-Optimized Search Results (completed in 0.4124 seconds):\n", + "Vector Index-Optimized Search Results (completed in 0.4124 seconds):\n", "--------------------------------------------------------------------------------\n", "[Result 1] Vector Distance: 0.2956\n", "Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", @@ -940,7 +940,7 @@ } ], "source": [ - "# Phase 2: GSI-Optimized Performance (With BHIVE Index)\n", + "# Phase 2: Vector Index-Optimized Performance (With BHIVE Index)\n", "print(\"\\n\" + \"=\"*80)\n", "print(\"PHASE 2: GSI-OPTIMIZED PERFORMANCE (WITH BHIVE INDEX)\")\n", "print(\"=\"*80)\n", @@ -956,7 +956,7 @@ " logging.info(f\"GSI-optimized search completed in {gsi_time:.2f} seconds\")\n", "\n", " # Display search results\n", - " print(f\"\\nGSI-Optimized Search Results (completed in {gsi_time:.4f} seconds):\")\n", + " print(f\"\\nVector Index-Optimized Search Results (completed in {gsi_time:.4f} seconds):\")\n", " print(\"-\" * 80)\n", " for i, (doc, distance) in enumerate(search_results, 1):\n", " print(f\"[Result {i}] Vector Distance: {distance:.4f}\")\n", @@ -996,32 +996,32 @@ "VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", "================================================================================\n", "\n", - "📊 Performance Comparison:\n", + "\ud83d\udcca Performance Comparison:\n", "Optimization Level Time (seconds) Status\n", "--------------------------------------------------------------------------------\n", - "Phase 1 - Baseline (No Index) 1.3612 ⚪ Baseline\n", - "Phase 2 - GSI-Optimized (BHIVE) 0.4124 ✅ Optimized\n", + "Phase 1 - Baseline (No Index) 1.3612 \u26aa Baseline\n", + "Phase 2 - Vector Index-Optimized (BHIVE) 0.4124 \u2705 Optimized\n", "\n", - "✨ GSI Performance Gain: 3.30x faster (69.7% improvement)\n", + "\u2728 GSI Performance Gain: 3.30x faster (69.7% improvement)\n", "\n", "--------------------------------------------------------------------------------\n", "KEY INSIGHTS:\n", "--------------------------------------------------------------------------------\n", - "1. 🚀 GSI Optimization:\n", - " • BHIVE indexes excel with large-scale datasets (millions+ vectors)\n", - " • Performance gains increase with dataset size and concurrent queries\n", - " • Optimal for production workloads with sustained traffic patterns\n", + "1. \ud83d\ude80 GSI Optimization:\n", + " \u2022 BHIVE indexes excel with large-scale datasets (millions+ vectors)\n", + " \u2022 Performance gains increase with dataset size and concurrent queries\n", + " \u2022 Optimal for production workloads with sustained traffic patterns\n", "\n", - "2. 📦 Dataset Size Impact:\n", - " • Current dataset: ~1,700 articles\n", - " • At this scale, performance differences may be minimal or variable\n", - " • Significant gains typically seen with 10M+ vectors\n", + "2. \ud83d\udce6 Dataset Size Impact:\n", + " \u2022 Current dataset: ~1,700 articles\n", + " \u2022 At this scale, performance differences may be minimal or variable\n", + " \u2022 Significant gains typically seen with 10M+ vectors\n", "\n", - "3. 🎯 When to Use GSI:\n", - " • Large-scale vector search applications\n", - " • High query-per-second (QPS) requirements\n", - " • Multi-user concurrent access scenarios\n", - " • Production environments requiring scalability\n", + "3. \ud83c\udfaf When to Use GSI:\n", + " \u2022 Large-scale vector search applications\n", + " \u2022 High query-per-second (QPS) requirements\n", + " \u2022 Multi-user concurrent access scenarios\n", + " \u2022 Production environments requiring scalability\n", "\n", "================================================================================\n" ] @@ -1032,42 +1032,42 @@ "print(\"VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", "print(\"=\"*80)\n", "\n", - "print(f\"\\n📊 Performance Comparison:\")\n", + "print(f\"\\n\ud83d\udcca Performance Comparison:\")\n", "print(f\"{'Optimization Level':<35} {'Time (seconds)':<20} {'Status'}\")\n", "print(\"-\" * 80)\n", - "print(f\"{'Phase 1 - Baseline (No Index)':<35} {baseline_time:.4f}{'':16} ⚪ Baseline\")\n", - "print(f\"{'Phase 2 - GSI-Optimized (BHIVE)':<35} {gsi_time:.4f}{'':16} ✅ Optimized\")\n", + "print(f\"{'Phase 1 - Baseline (No Index)':<35} {baseline_time:.4f}{'':16} \u26aa Baseline\")\n", + "print(f\"{'Phase 2 - Vector Index-Optimized (BHIVE)':<35} {gsi_time:.4f}{'':16} \u2705 Optimized\")\n", "\n", "# Calculate improvement\n", "if baseline_time > gsi_time:\n", " speedup = baseline_time / gsi_time\n", " improvement = ((baseline_time - gsi_time) / baseline_time) * 100\n", - " print(f\"\\n✨ GSI Performance Gain: {speedup:.2f}x faster ({improvement:.1f}% improvement)\")\n", + " print(f\"\\n\u2728 GSI Performance Gain: {speedup:.2f}x faster ({improvement:.1f}% improvement)\")\n", "elif gsi_time > baseline_time:\n", " slowdown_pct = ((gsi_time - baseline_time) / baseline_time) * 100\n", - " print(f\"\\n⚠️ Note: GSI was {slowdown_pct:.1f}% slower than baseline in this run\")\n", + " print(f\"\\n\u26a0\ufe0f Note: GSI was {slowdown_pct:.1f}% slower than baseline in this run\")\n", " print(f\" This can happen with small datasets. GSI benefits emerge with scale.\")\n", "else:\n", - " print(f\"\\n⚖️ Performance: Comparable to baseline\")\n", + " print(f\"\\n\u2696\ufe0f Performance: Comparable to baseline\")\n", "\n", "print(\"\\n\" + \"-\"*80)\n", "print(\"KEY INSIGHTS:\")\n", "print(\"-\"*80)\n", - "print(\"1. 🚀 GSI Optimization:\")\n", - "print(\" • BHIVE indexes excel with large-scale datasets (millions+ vectors)\")\n", - "print(\" • Performance gains increase with dataset size and concurrent queries\")\n", - "print(\" • Optimal for production workloads with sustained traffic patterns\")\n", - "\n", - "print(\"\\n2. 📦 Dataset Size Impact:\")\n", - "print(f\" • Current dataset: ~1,700 articles\")\n", - "print(\" • At this scale, performance differences may be minimal or variable\")\n", - "print(\" • Significant gains typically seen with 10M+ vectors\")\n", - "\n", - "print(\"\\n3. 🎯 When to Use GSI:\")\n", - "print(\" • Large-scale vector search applications\")\n", - "print(\" • High query-per-second (QPS) requirements\")\n", - "print(\" • Multi-user concurrent access scenarios\")\n", - "print(\" • Production environments requiring scalability\")\n", + "print(\"1. \ud83d\ude80 GSI Optimization:\")\n", + "print(\" \u2022 BHIVE indexes excel with large-scale datasets (millions+ vectors)\")\n", + "print(\" \u2022 Performance gains increase with dataset size and concurrent queries\")\n", + "print(\" \u2022 Optimal for production workloads with sustained traffic patterns\")\n", + "\n", + "print(\"\\n2. \ud83d\udce6 Dataset Size Impact:\")\n", + "print(f\" \u2022 Current dataset: ~1,700 articles\")\n", + "print(\" \u2022 At this scale, performance differences may be minimal or variable\")\n", + "print(\" \u2022 Significant gains typically seen with 10M+ vectors\")\n", + "\n", + "print(\"\\n3. \ud83c\udfaf When to Use GSI:\")\n", + "print(\" \u2022 Large-scale vector search applications\")\n", + "print(\" \u2022 High query-per-second (QPS) requirements\")\n", + "print(\" \u2022 Multi-user concurrent access scenarios\")\n", + "print(\" \u2022 Production environments requiring scalability\")\n", "\n", "print(\"\\n\" + \"=\"*80)\n" ] @@ -1355,7 +1355,7 @@ " \"becomes more obvious - and daunting. Manchester City\\\\'s fans did their best \"\n", " 'to reassure Guardiola of their faith in him with a giant Barcelona-inspired '\n", " 'banner draped from the stands before kick-off emblazoned with his image '\n", - " 'reading \"Més que un entrenador\" - \"More Than A Coach\". And Guardiola will '\n", + " 'reading \"M\u00e9s que un entrenador\" - \"More Than A Coach\". And Guardiola will '\n", " 'now need to be more than a coach than at any time in his career. He will '\n", " \"have the finances but it will be done with City\\\\'s challengers also \"\n", " 'strengthening. Kevin de Bruyne, 34 in June, lasted 68 minutes here before he '\n", @@ -1437,7 +1437,7 @@ " 'a half-fit Manuel Akanji and John Stones at Villa Park but that does not '\n", " 'account for City looking a shadow of their former selves. That does not '\n", " 'justify the error Josko Gvardiol made to gift Jhon Duran a golden chance '\n", - " 'inside the first 20 seconds, or £100m man Jack Grealish again failing to '\n", + " 'inside the first 20 seconds, or \u00a3100m man Jack Grealish again failing to '\n", " \"have an impact on a game. There may be legitimate reasons for City\\\\'s drop \"\n", " 'off, whether that be injuries, mental fatigue or just simply a team coming '\n", " 'to the end of its lifecycle, but their form, which has plunged off a cliff '\n", @@ -1579,7 +1579,7 @@ " 'will be done in the summer. But they are open to any opportunities in '\n", " 'January - and a holding midfielder is one thing they need. In the summer '\n", " \"City might want to get Spain\\\\'s Martin Zubimendi from Real Sociedad and \"\n", - " 'they know 60m euros (£50m) will get him. He said no to Liverpool last summer '\n", + " 'they know 60m euros (\u00a350m) will get him. He said no to Liverpool last summer '\n", " 'even though everything was agreed, but he now wants to move on and the '\n", " 'Premier League is the target. Even if they do not get Zubimendi, that is the '\n", " 'calibre of footballer they are after. A new Manchester City is on its way - '\n", @@ -1629,7 +1629,7 @@ "metadata": {}, "source": [ "## Conclusion\n", - "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and PydanticAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using GSI which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine using PydanticAI's agent-based approach." + "By following these steps, you'll have a fully functional semantic search engine that leverages the strengths of Couchbase and PydanticAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how it improves querying data more efficiently using Hyperscale and Composite Vector Indexes which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, AI-driven search engine using PydanticAI's agent-based approach." ] } ], @@ -1654,4 +1654,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/pydantic_ai/gsi/frontmatter.md b/pydantic_ai/query_based/frontmatter.md similarity index 100% rename from pydantic_ai/gsi/frontmatter.md rename to pydantic_ai/query_based/frontmatter.md diff --git a/pydantic_ai/fts/.env.sample b/pydantic_ai/search_based/.env.sample similarity index 100% rename from pydantic_ai/fts/.env.sample rename to pydantic_ai/search_based/.env.sample diff --git a/pydantic_ai/search_based/RAG_with_Couchbase_and_PydanticAI.ipynb b/pydantic_ai/search_based/RAG_with_Couchbase_and_PydanticAI.ipynb new file mode 100644 index 00000000..c45e8c1c --- /dev/null +++ b/pydantic_ai/search_based/RAG_with_Couchbase_and_PydanticAI.ipynb @@ -0,0 +1,899 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "kNdImxzypDlm" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com) as the embedding and LLM provider, and [PydanticAI](https://ai.pydantic.dev) as an agent orchestrator. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-pydantic-ai-couchbase-rag-with-hyperscale-or-composite-vector-index)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively.\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NH2o6pqa69oG" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "DYhPj0Ta8l_A" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet -U datasets==3.5.0 langchain-couchbase==0.3.0 langchain-openai==0.3.13 python-dotenv==1.1.0 pydantic-ai==0.1.1 ipywidgets==8.1.6" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1pp7GtNg8mB9" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "id": "8GzS6tfL8mFP" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from uuid import uuid4\n", + "from datetime import timedelta\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from tqdm import tqdm\n", + "\n", + "from dataclasses import dataclass\n", + "from pydantic_ai import Agent, RunContext" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pBnMp5vb8mIb" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "id": "Yv8kWcuf8mLx" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K9G5a0en8mPA" + }, + "source": [ + "# Loading Sensitive Information\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "id": "PFGyHll18mSe" + }, + "outputs": [], + "source": [ + "load_dotenv()\n", + "\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API Key: ')\n", + "\n", + "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", + "INDEX_NAME = os.getenv('INDEX_NAME') or input('Enter your index name (default: vector_search_pydantic_ai): ') or 'vector_search_pydantic_ai'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: pydantic_ai): ') or 'pydantic_ai'\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not OPENAI_API_KEY:\n", + " raise ValueError(\"Missing OpenAI API Key\")\n", + "\n", + "if 'OPENAI_API_KEY' not in os.environ:\n", + " os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtGrYzUY8mV3" + }, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "id": "Zb3kK-7W8mZK" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-04-11 13:54:19,537 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_Gpy32N8mcZ" + }, + "source": [ + "# Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Creates primary index on collection for query performance\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "id": "ACZcwUnG8mf2" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-04-11 13:54:23,668 - INFO - Bucket 'vector-search-testing' does not exist. Creating it...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-04-11 13:54:25,721 - INFO - Bucket 'vector-search-testing' created successfully.\n", + "2025-04-11 13:54:25,728 - INFO - Scope 'shared' does not exist. Creating it...\n", + "2025-04-11 13:54:25,777 - INFO - Scope 'shared' created successfully.\n", + "2025-04-11 13:54:25,796 - INFO - Collection 'pydantic_ai' does not exist. Creating it...\n", + "2025-04-11 13:54:27,843 - INFO - Collection 'pydantic_ai' created successfully.\n", + "2025-04-11 13:54:28,120 - INFO - Primary index present or created successfully.\n", + "2025-04-11 13:54:28,133 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " time.sleep(2)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists.Skipping creation.\")\n", + "\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Ensure primary index exists\n", + " try:\n", + " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", + " logging.info(\"Primary index present or created successfully.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error creating primary index: {str(e)}\")\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + "\n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NMJ7RRYp8mjV" + }, + "source": [ + "# Loading Couchbase Vector Search Index\n", + "\n", + "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", + "\n", + "This vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `pydantic_ai`. The configuration is set up for vectors with exactly `1536 dimensions`, using dot product similarity and optimized for recall. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", + "\n", + "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "id": "y7xiCrOc8mmj" + }, + "outputs": [], + "source": [ + "# If you are running this script locally (not in Google Colab), uncomment the following line\n", + "# and provide the path to your index definition file.\n", + "\n", + "# index_definition_path = '/path_to_your_index_file/pydantic_ai_index.json' # Local setup: specify your file path here\n", + "\n", + "# # Version for Google Colab\n", + "# def load_index_definition_colab():\n", + "# from google.colab import files\n", + "# print(\"Upload your index definition file\")\n", + "# uploaded = files.upload()\n", + "# index_definition_path = list(uploaded.keys())[0]\n", + "\n", + "# try:\n", + "# with open(index_definition_path, 'r') as file:\n", + "# index_definition = json.load(file)\n", + "# return index_definition\n", + "# except Exception as e:\n", + "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Version for Local Environment\n", + "def load_index_definition_local(index_definition_path):\n", + " try:\n", + " with open(index_definition_path, 'r') as file:\n", + " index_definition = json.load(file)\n", + " return index_definition\n", + " except Exception as e:\n", + " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Usage\n", + "# Uncomment the appropriate line based on your environment\n", + "# index_definition = load_index_definition_colab()\n", + "index_definition = load_index_definition_local('pydantic_ai_index.json')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v_ddPQ_Y8mpm" + }, + "source": [ + "# Creating or Updating Search Indexes\n", + "\n", + "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "id": "bHEpUu1l8msx" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-04-11 13:54:41,157 - INFO - Creating new index 'vector-search-testing.shared.vector_search_pydantic_ai'...\n", + "2025-04-11 13:54:41,316 - INFO - Index 'vector-search-testing.shared.vector_search_pydantic_ai' successfully created/updated.\n" + ] + } + ], + "source": [ + "try:\n", + " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", + "\n", + " # Check if index already exists\n", + " existing_indexes = scope_index_manager.get_all_indexes()\n", + " index_name = index_definition[\"name\"]\n", + "\n", + " if index_name in [index.name for index in existing_indexes]:\n", + " logging.info(f\"Index '{index_name}' found\")\n", + " else:\n", + " logging.info(f\"Creating new index '{index_name}'...\")\n", + "\n", + " # Create SearchIndex object from JSON definition\n", + " search_index = SearchIndex.from_json(index_definition)\n", + "\n", + " # Upsert the index (create if not exists, update if exists)\n", + " scope_index_manager.upsert_index(search_index)\n", + " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", + "\n", + "except QueryIndexAlreadyExistsException:\n", + " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", + "\n", + "except InternalServerFailureException as e:\n", + " error_message = str(e)\n", + " logging.error(f\"InternalServerFailureException raised: {error_message}\")\n", + "\n", + " try:\n", + " # Accessing the response_body attribute from the context\n", + " error_context = e.context\n", + " response_body = error_context.response_body\n", + " if response_body:\n", + " error_details = json.loads(response_body)\n", + " error_message = error_details.get('error', '')\n", + "\n", + " if \"collection: 'pydantic_ai' doesn't belong to scope: 'shared'\" in error_message:\n", + " raise ValueError(\"Collection 'pydantic_ai' does not belong to scope 'shared'. Please check the collection and scope names.\")\n", + "\n", + " except ValueError as ve:\n", + " logging.error(str(ve))\n", + " raise\n", + "\n", + " except Exception as json_error:\n", + " logging.error(f\"Failed to parse the error message: {json_error}\")\n", + " raise RuntimeError(f\"Internal server error while creating/updating search index: {error_message}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7FvxRsg38m3G" + }, + "source": [ + "# Creating OpenAI Embeddings\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "id": "_75ZyCRh8m6m" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-04-11 13:55:10,426 - INFO - Successfully created OpenAIEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = OpenAIEmbeddings(\n", + " model=\"text-embedding-3-small\",\n", + " api_key=OPENAI_API_KEY,\n", + " )\n", + " logging.info(\"Successfully created OpenAIEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IwZMUnF8m-N" + }, + "source": [ + "# Setting Up the Couchbase Vector Store\n", + "The vector store is set up to manage the embeddings created in the previous step. The vector store is essentially a database optimized for storing and retrieving high-dimensional vectors. In this case, the vector store is built on top of Couchbase, allowing the script to store the embeddings in a way that can be efficiently searched." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "id": "DwIJQjYT9RV_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-04-11 13:55:12,849 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " index_name=INDEX_NAME,\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-04-11 13:55:22,967 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "With the Vector store set up, the next step is to populate it with data. We save the BBC articles dataset to the vector store. For each document, we will generate the embeddings for the article to use with the semantic search using LangChain. Here one of the articles is larger than the maximum tokens that we can use for our embedding model. If we want to ingest that document, we could split the document and ingest it in parts. However, since it is only a single document for simplicity, we ignore that document from the ingestion process." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [], + "source": [ + "# Save the current logging level\n", + "current_logging_level = logging.getLogger().getEffectiveLevel()\n", + "\n", + "# # Set logging level to CRITICAL to suppress lower level logs\n", + "logging.getLogger().setLevel(logging.CRITICAL)\n", + "\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles\n", + " )\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n", + "\n", + "# Restore the original logging level\n", + "logging.getLogger().setLevel(current_logging_level)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# PydanticAI: An Introduction\n", + "From [PydanticAI](https://ai.pydantic.dev/)'s website:\n", + "\n", + "> PydanticAI is a Python agent framework designed to make it less painful to build production grade applications with Generative AI.\n", + "\n", + "PydanticAI allows us to define agents and tools easily to create Gen-AI apps in an innovative and painless manner. Some of its features are:\n", + "- Built by the Pydantic Team: Built by the team behind Pydantic (the validation layer of the OpenAI SDK, the Anthropic SDK, LangChain, LlamaIndex, AutoGPT, Transformers, CrewAI, Instructor and many more).\n", + "\n", + "- Model-agnostic: Supports OpenAI, Anthropic, Gemini, Deepseek, Ollama, Groq, Cohere, and Mistral, and there is a simple interface to implement support for other models.\n", + "\n", + "- Type-safe: Designed to make type checking as powerful and informative as possible for you.\n", + "\n", + "- Python-centric Design: Leverages Python's familiar control flow and agent composition to build your AI-driven projects, making it easy to apply standard Python best practices you'd use in any other (non-AI) project.\n", + "\n", + "- Structured Responses: Harnesses the power of Pydantic to validate and structure model outputs, ensuring responses are consistent across runs.\n", + "\n", + "- Dependency Injection System: Offers an optional dependency injection system to provide data and services to your agent's system prompts, tools and result validators. This is useful for testing and eval-driven iterative development.\n", + "\n", + "- Streamed Responses: Provides the ability to stream LLM outputs continuously, with immediate validation, ensuring rapid and accurate results.\n", + "\n", + "- Graph Support: Pydantic Graph provides a powerful way to define graphs using typing hints, this is useful in complex applications where standard control flow can degrade to spaghetti code.\n", + "\n", + "# Building a RAG Agent using PydanticAI\n", + "\n", + "PydanticAI makes heavy use of dependency injection to provide data and services to your agent's system prompts and tools. We define dependencies using a `dataclass`, which serves as a container for our dependencies.\n", + "\n", + "In our case, the only dependency for our agent to work in the `CouchbaseSearchVectorStore` instance. However, we will still use a `dataclass` as it is good practice. In the future, in case we wish to add more dependencies, we can just add more fields to the `dataclass` `Deps`.\n", + "\n", + "We also initialize an agent as a GPT-4o model. PydanticAI supports many different LLM providers, including Anthropic, Google, Cohere, etc. which can also be used. While initializing the agent, we also pass the type of the dependencies. This is mainly used for type checking, and not actually used at runtime." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [], + "source": [ + "@dataclass\n", + "class Deps:\n", + " vector_store: CouchbaseSearchVectorStore\n", + "\n", + "agent = Agent(\"openai:gpt-4o\", deps_type=Deps)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Defining the Vector Store as a Tool\n", + "PydanticAI has the concept of `function tools`, which are functions that can be called by LLMs to retrieve extra information that can help form a better response.\n", + "\n", + "We can perform RAG by creating a tool which retrieves documents that are semantically similar to the query, and allowing the agent to call the tool when required. We can add the function as a tool using the `@agent.tool` decorator.\n", + "\n", + "Notice that we also add the `context` parameter, which contains the dependencies that are passed to the tool (in this case, the only dependency is the vector store)." + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "@agent.tool\n", + "async def retrieve(context: RunContext[Deps], search_query: str) -> str:\n", + " \"\"\"Retrieve news data based on a search query.\n", + "\n", + " Args:\n", + " context: The call context\n", + " search_query: The search query\n", + " \"\"\"\n", + " search_results = context.deps.vector_store.similarity_search_with_score(search_query, k=5)\n", + " return \"\\n\\n\".join(\n", + " f\"# Documents:\\n{doc.page_content}\"\n", + " for doc, score in search_results\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we create a function that allows us to define our dependencies and run our agent." + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [], + "source": [ + "async def run_agent(question: str):\n", + " deps = Deps(\n", + " vector_store=vector_store,\n", + " )\n", + " answer = await agent.run(question, deps=deps)\n", + " return answer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Running our Agent\n", + "We have now finished setting up our vector store and agent! The system is now ready to accept queries." + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-04-11 13:56:53,839 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n", + "2025-04-11 13:56:54,485 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n", + "2025-04-11 13:57:01,928 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================== Agent Output ====================\n", + "Pep Guardiola has expressed a mix of determination and concern regarding Manchester City's current form. He acknowledged the personal impact of the team's downturn, admitting that the situation has affected his sleep and diet due to the worst run of results he has ever faced in his managerial career. Guardiola described his state of mind as \"ugly,\" noting the team's precarious position in competitions and the need to defend better and avoid mistakes.\n", + "\n", + "Despite these challenges, Guardiola remains committed to finding solutions, emphasizing the need to improve defensive concepts and restore the team's intensity and form. He acknowledged the errors from some of the best players in the world and expressed a need for the team to stay positive and for players to have the necessary support to overcome their current struggles.\n", + "\n", + "Moreover, Guardiola expressed a pragmatic view of the situation, accepting that the team must \"survive\" the season and acknowledging a potential need for a significant rebuild to address the challenges they're facing. As a testament to his commitment, he noted his intention to continue shaping the club during his newly extended contract period. Throughout, he reiterated his belief in the team and emphasized the need to find a way forward.\n" + ] + } + ], + "source": [ + "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + "output = await run_agent(query)\n", + "\n", + "print(\"=\" * 20, \"Agent Output\", \"=\" * 20)\n", + "print(output.data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Inspecting the Agent\n", + "We can use the `all_messages()` method in the output object to observe how the agent and tools work.\n", + "\n", + "In the cell below, we see an extremely detailed list of all the model's messages and tool calls, which happens step by step:\n", + "1. The `UserPromptPart`, which consists of the query the user sends to the agent.\n", + "2. The agent calls the `retrieve` tool in the `ToolCallPart` message. This includes the `search_query` argument. Couchbase uses this `search_query` to perform semantic search over all the ingested news articles.\n", + "3. The `retrieve` tool returns a `ToolReturnPart` object with all the context required for the model to answer the user's query. The retrieve documents were truncated, because a large amount of context was retrieved. \n", + "4. The final message is the LLM generated response with the added context, which is sent back to the user." + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Step 1:\n", + "('ModelRequest(parts=[UserPromptPart(content=\"What was manchester city manager '\n", + " 'pep guardiola\\'s reaction to the team\\'s current form?\", '\n", + " 'timestamp=datetime.datetime(2025, 4, 11, 8, 26, 52, 836357, '\n", + " \"tzinfo=datetime.timezone.utc), part_kind='user-prompt')], kind='request')\")\n", + "==================================================\n", + "Step 2:\n", + "(\"ModelResponse(parts=[ToolCallPart(tool_name='retrieve', \"\n", + " 'args=\\'{\"search_query\":\"Pep Guardiola reaction to Manchester City current '\n", + " 'form\"}\\', tool_call_id=\\'call_oo4Jjn93VkRJ3q9PnAwkt3xm\\', '\n", + " \"part_kind='tool-call')], model_name='gpt-4o-2024-08-06', \"\n", + " 'timestamp=datetime.datetime(2025, 4, 11, 8, 26, 53, '\n", + " \"tzinfo=datetime.timezone.utc), kind='response')\")\n", + "==================================================\n", + "Step 3:\n", + "(\"ModelRequest(parts=[ToolReturnPart(tool_name='retrieve', content='# \"\n", + " 'Documents:\\\\nManchester City boss Pep Guardiola has won 18 trophies since he '\n", + " 'arrived at the club in 2016\\\\n\\\\nManchester City boss Pep Guardiola says he '\n", + " 'is \"fine\" despite admitting his sleep and diet are being affected by the '\n", + " 'worst run of results in his entire managerial career. In an interview with '\n", + " 'former Italy international Luca Toni for Amazon Prime Sport before '\n", + " \"Wednesday\\\\'s Champions League defeat by Juventus, Guardiola touched on the \"\n", + " \"personal impact City\\\\'s sudden downturn in form has had. Guardiola said his \"\n", + " 'state of mind was \"ugly\", that his sleep was \"worse\" and he was eating '\n", + " \"lighter as his digestion had suffered. City go into Sunday\\\\'s derby against \"\n", + "\n... (output truncated for brevity)\n" + ] + } + ], + "source": [ + "from pprint import pprint\n", + "\n", + "for idx, message in enumerate(output.all_messages(), start=1):\n", + " print(f\"Step {idx}:\")\n", + " pprint(message.__repr__())\n", + " print(\"=\" * 50)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/pydantic_ai/fts/frontmatter.md b/pydantic_ai/search_based/frontmatter.md similarity index 100% rename from pydantic_ai/fts/frontmatter.md rename to pydantic_ai/search_based/frontmatter.md diff --git a/pydantic_ai/fts/pydantic_ai_index.json b/pydantic_ai/search_based/pydantic_ai_index.json similarity index 100% rename from pydantic_ai/fts/pydantic_ai_index.json rename to pydantic_ai/search_based/pydantic_ai_index.json diff --git a/smolagents/fts/RAG_with_Couchbase_and_SmolAgents.ipynb b/smolagents/fts/RAG_with_Couchbase_and_SmolAgents.ipynb deleted file mode 100644 index f3a78c36..00000000 --- a/smolagents/fts/RAG_with_Couchbase_and_SmolAgents.ipynb +++ /dev/null @@ -1,1002 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "kNdImxzypDlm" - }, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com) as the embedding and LLM provider, and [Hugging Face smolagents](https://huggingface.co/docs/smolagents/en/index) as an agent framework. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com/tutorial-smolagents-couchbase-rag-with-global-secondary-index)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively.\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Before you start\n", - "## Get Credentials for OpenAI\n", - "Please follow the [instructions](https://platform.openai.com/docs/quickstart) to generate the OpenAI credentials.\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NH2o6pqa69oG" - }, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "DYhPj0Ta8l_A" - }, - "outputs": [], - "source": [ - "%pip install --quiet -U datasets==3.5.0 langchain-couchbase==0.3.0 langchain-openai==0.3.13 python-dotenv==1.1.0 smolagents==1.13.0 ipywidgets==8.1.6" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1pp7GtNg8mB9" - }, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "8GzS6tfL8mFP" - }, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (InternalServerFailureException,\n", - " ServiceUnavailableException,\n", - " QueryIndexAlreadyExistsException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.management.search import SearchIndex\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", - "from langchain_openai import OpenAIEmbeddings\n", - "\n", - "from smolagents import Tool, OpenAIServerModel, ToolCallingAgent" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pBnMp5vb8mIb" - }, - "source": [ - "# Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "Yv8kWcuf8mLx" - }, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K9G5a0en8mPA" - }, - "source": [ - "# Loading Sensitive Information\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "PFGyHll18mSe" - }, - "outputs": [], - "source": [ - "load_dotenv()\n", - "\n", - "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API Key: ')\n", - "\n", - "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", - "INDEX_NAME = os.getenv('INDEX_NAME') or input('Enter your index name (default: vector_search_smolagents): ') or 'vector_search_smolagents'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: smolagents): ') or 'smolagents'\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not OPENAI_API_KEY:\n", - " raise ValueError(\"Missing OpenAI API Key\")\n", - "\n", - "if 'OPENAI_API_KEY' not in os.environ:\n", - " os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtGrYzUY8mV3" - }, - "source": [ - "# Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "Zb3kK-7W8mZK" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:30:17,515 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C_Gpy32N8mcZ" - }, - "source": [ - "# Setting Up Collections in Couchbase\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "2. Scope Management:\n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "\n", - "- Creates primary index on collection for query performance\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n", - "\n", - "The function is called twice to set up:\n", - "\n", - "1. Main collection for vector embeddings\n", - "2. Cache collection for storing results" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "ACZcwUnG8mf2" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:30:20,855 - INFO - Bucket 'vector-search-testing' exists.\n", - "2025-02-28 10:30:21,350 - INFO - Collection 'smolagents' does not exist. Creating it...\n", - "2025-02-28 10:30:21,619 - INFO - Collection 'smolagents' created successfully.\n", - "2025-02-28 10:30:26,886 - INFO - Primary index present or created successfully.\n", - "2025-02-28 10:30:26,938 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Ensure primary index exists\n", - " try:\n", - " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", - " logging.info(\"Primary index present or created successfully.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error creating primary index: {str(e)}\")\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NMJ7RRYp8mjV" - }, - "source": [ - "# Loading Couchbase Vector Search Index\n", - "\n", - "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", - "\n", - "This vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `smolagents`. The configuration is set up for vectors with exactly `1536 dimensions`, using dot product similarity and optimized for recall. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", - "\n", - "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "y7xiCrOc8mmj" - }, - "outputs": [], - "source": [ - "# If you are running this script locally (not in Google Colab), uncomment the following line\n", - "# and provide the path to your index definition file.\n", - "\n", - "# index_definition_path = '/path_to_your_index_file/smolagents_index.json' # Local setup: specify your file path here\n", - "\n", - "# # Version for Google Colab\n", - "# def load_index_definition_colab():\n", - "# from google.colab import files\n", - "# print(\"Upload your index definition file\")\n", - "# uploaded = files.upload()\n", - "# index_definition_path = list(uploaded.keys())[0]\n", - "\n", - "# try:\n", - "# with open(index_definition_path, 'r') as file:\n", - "# index_definition = json.load(file)\n", - "# return index_definition\n", - "# except Exception as e:\n", - "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Version for Local Environment\n", - "def load_index_definition_local(index_definition_path):\n", - " try:\n", - " with open(index_definition_path, 'r') as file:\n", - " index_definition = json.load(file)\n", - " return index_definition\n", - " except Exception as e:\n", - " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", - "\n", - "# Usage\n", - "# Uncomment the appropriate line based on your environment\n", - "# index_definition = load_index_definition_colab()\n", - "index_definition = load_index_definition_local('smolagents_index.json')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v_ddPQ_Y8mpm" - }, - "source": [ - "# Creating or Updating Search Indexes\n", - "\n", - "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "bHEpUu1l8msx" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:30:32,890 - INFO - Creating new index 'vector-search-testing.shared.vector_search_smolagents'...\n", - "2025-02-28 10:30:33,058 - INFO - Index 'vector-search-testing.shared.vector_search_smolagents' successfully created/updated.\n" - ] - } - ], - "source": [ - "try:\n", - " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", - "\n", - " # Check if index already exists\n", - " existing_indexes = scope_index_manager.get_all_indexes()\n", - " index_name = index_definition[\"name\"]\n", - "\n", - " if index_name in [index.name for index in existing_indexes]:\n", - " logging.info(f\"Index '{index_name}' found\")\n", - " else:\n", - " logging.info(f\"Creating new index '{index_name}'...\")\n", - "\n", - " # Create SearchIndex object from JSON definition\n", - " search_index = SearchIndex.from_json(index_definition)\n", - "\n", - " # Upsert the index (create if not exists, update if exists)\n", - " scope_index_manager.upsert_index(search_index)\n", - " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", - "\n", - "except QueryIndexAlreadyExistsException:\n", - " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", - "except ServiceUnavailableException:\n", - " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", - "except InternalServerFailureException as e:\n", - " logging.error(f\"Internal server error: {str(e)}\")\n", - " raise" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7FvxRsg38m3G" - }, - "source": [ - "# Creating OpenAI Embeddings\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "_75ZyCRh8m6m" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:30:36,983 - INFO - Successfully created OpenAIEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = OpenAIEmbeddings(\n", - " model=\"text-embedding-3-small\",\n", - " api_key=OPENAI_API_KEY,\n", - " )\n", - " logging.info(\"Successfully created OpenAIEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8IwZMUnF8m-N" - }, - "source": [ - "# Setting Up the Couchbase Vector Store\n", - "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "DwIJQjYT9RV_" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:30:40,503 - INFO - Successfully created vector store\n" - ] - } - ], - "source": [ - "try:\n", - " vector_store = CouchbaseSearchVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " index_name=INDEX_NAME,\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:30:51,981 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Error Handling: If an error occurs, only the current batch is affected\n", - "3. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "4. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 100 to ensure reliable operation. The optimal batch size depends on many factors including:\n", - "\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "# Save the current logging level\n", - "current_logging_level = logging.getLogger().getEffectiveLevel()\n", - "\n", - "# # Set logging level to CRITICAL to suppress lower level logs\n", - "logging.getLogger().setLevel(logging.CRITICAL)\n", - "\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=100\n", - " )\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n", - "\n", - "# Restore the original logging level\n", - "logging.getLogger().setLevel(current_logging_level)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# smolagents: An Introduction\n", - "[smolagents](https://huggingface.co/docs/smolagents/en/index) is a agentic framework by Hugging Face for easy creation of agents in a few lines of code.\n", - "\n", - "Some of the features of smolagents are:\n", - "\n", - "- ✨ Simplicity: the logic for agents fits in ~1,000 lines of code (see agents.py). We kept abstractions to their minimal shape above raw code!\n", - "\n", - "- 🧑‍💻 First-class support for Code Agents. Our CodeAgent writes its actions in code (as opposed to \"agents being used to write code\"). To make it secure, we support executing in sandboxed environments via E2B.\n", - "\n", - "- 🤗 Hub integrations: you can share/pull tools to/from the Hub, and more is to come!\n", - "\n", - "- 🌐 Model-agnostic: smolagents supports any LLM. It can be a local transformers or ollama model, one of many providers on the Hub, or any model from OpenAI, Anthropic and many others via our LiteLLM integration.\n", - "\n", - "- 👁️ Modality-agnostic: Agents support text, vision, video, even audio inputs! Cf this tutorial for vision.\n", - "\n", - "- 🛠️ Tool-agnostic: you can use tools from LangChain, Anthropic's MCP, you can even use a Hub Space as a tool.\n", - "\n", - "# Building a RAG Agent using smolagents\n", - "\n", - "smolagents allows users to define their own tools for the agent to use. These tools can be of two types:\n", - "1. Tools defined as classes: These tools are subclassed from the `Tool` class and must override the `forward` method, which is called when the tool is used.\n", - "2. Tools defined as functions: These are simple functions that are called when the tool is used, and are decorated with the `@tool` decorator.\n", - "\n", - "In our case, we will use the first method, and we define our `RetrieverTool` below. We define a name, a description and a dictionary of inputs that the tool accepts. This helps the LLM properly identify and use the tool.\n", - "\n", - "The `RetrieverTool` is simple: it takes a query generated by the user, and uses Couchbase's performant vector search service under the hood to search for semantically similar documents to the query. The LLM can then use this context to answer the user's question." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "class RetrieverTool(Tool):\n", - " name = \"retriever\"\n", - " description = \"Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query.\"\n", - " inputs = {\n", - " \"query\": {\n", - " \"type\": \"string\",\n", - " \"description\": \"The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.\",\n", - " }\n", - " }\n", - " output_type = \"string\"\n", - "\n", - " def __init__(self, vector_store: CouchbaseSearchVectorStore, **kwargs):\n", - " super().__init__(**kwargs)\n", - " self.vector_store = vector_store\n", - "\n", - " def forward(self, query: str) -> str:\n", - " assert isinstance(query, str), \"Query must be a string\"\n", - "\n", - " docs = self.vector_store.similarity_search_with_score(query, k=5)\n", - " return \"\\n\\n\".join(\n", - " f\"# Documents:\\n{doc.page_content}\"\n", - " for doc, score in docs\n", - " )\n", - "\n", - "retriever_tool = RetrieverTool(vector_store)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Defining Our Agent\n", - "smolagents have predefined configurations for agents that we can use. We use the `ToolCallingAgent`, which writes its tool calls in a JSON format. Alternatively, there also exists a `CodeAgent`, in which the LLM defines it's functions in code.\n", - "\n", - "The `CodeAgent` is offers benefits in certain challenging scenarios: it can lead to [higher performance in difficult benchmarks](https://huggingface.co/papers/2411.01747) and use [30% fewer steps to solve problems](https://huggingface.co/papers/2402.01030). However, since our use case is just a simple RAG tool, a `ToolCallingAgent` will suffice." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "agent = ToolCallingAgent(\n", - " tools=[retriever_tool],\n", - " model=OpenAIServerModel(\n", - " model_id=\"gpt-4o-2024-08-06\",\n", - " api_key=OPENAI_API_KEY,\n", - " ),\n", - " max_steps=4,\n", - " verbosity_level=2\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Running our Agent\n", - "We have now finished setting up our vector store and agent! The system is now ready to accept queries." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
╭──────────────────────────────────────────────────── New run ────────────────────────────────────────────────────╮\n",
-              "                                                                                                                 \n",
-              " What was manchester city manager pep guardiola's reaction to the team's current form?                           \n",
-              "                                                                                                                 \n",
-              "╰─ OpenAIServerModel - gpt-4o-2024-08-06 ─────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[38;2;212;183;2m╭─\u001b[0m\u001b[38;2;212;183;2m───────────────────────────────────────────────────\u001b[0m\u001b[38;2;212;183;2m \u001b[0m\u001b[1;38;2;212;183;2mNew run\u001b[0m\u001b[38;2;212;183;2m \u001b[0m\u001b[38;2;212;183;2m───────────────────────────────────────────────────\u001b[0m\u001b[38;2;212;183;2m─╮\u001b[0m\n", - "\u001b[38;2;212;183;2m│\u001b[0m \u001b[38;2;212;183;2m│\u001b[0m\n", - "\u001b[38;2;212;183;2m│\u001b[0m \u001b[1mWhat was manchester city manager pep guardiola's reaction to the team's current form?\u001b[0m \u001b[38;2;212;183;2m│\u001b[0m\n", - "\u001b[38;2;212;183;2m│\u001b[0m \u001b[38;2;212;183;2m│\u001b[0m\n", - "\u001b[38;2;212;183;2m╰─\u001b[0m\u001b[38;2;212;183;2m OpenAIServerModel - gpt-4o-2024-08-06 \u001b[0m\u001b[38;2;212;183;2m────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[38;2;212;183;2m─╯\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[38;2;212;183;2m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ \u001b[0m\u001b[1mStep \u001b[0m\u001b[1;36m1\u001b[0m\u001b[38;2;212;183;2m ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:32:28,032 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "data": { - "text/html": [ - "
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
-              "│ Calling tool: 'retriever' with arguments: {'query': \"Pep Guardiola's reaction to Manchester City's current      │\n",
-              "│ form\"}                                                                                                          │\n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", - "│ Calling tool: 'retriever' with arguments: {'query': \"Pep Guardiola's reaction to Manchester City's current │\n", - "│ form\"} │\n", - "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:32:28,466 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "data": { - "text/html": [ - "
[Step 0: Duration 2.25 seconds| Input tokens: 1,010 | Output tokens: 23]\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[2m[Step 0: Duration 2.25 seconds| Input tokens: 1,010 | Output tokens: 23]\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[38;2;212;183;2m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ \u001b[0m\u001b[1mStep \u001b[0m\u001b[1;36m2\u001b[0m\u001b[38;2;212;183;2m ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-02-28 10:32:31,724 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "data": { - "text/html": [ - "
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
-              "│ Calling tool: 'final_answer' with arguments: {'answer': 'Manchester City manager Pep Guardiola has expressed a  │\n",
-              "│ mix of concern and determination regarding the team\\'s current form. Guardiola admitted that this is the worst  │\n",
-              "│ run of results in his managerial career and that it has affected his sleep and diet. He described his state of  │\n",
-              "│ mind as \"ugly\" and acknowledged that City needs to defend better and avoid making mistakes. Despite his         │\n",
-              "│ personal challenges, Guardiola stated that he is \"fine\" and focused on finding solutions.\\n\\nGuardiola also     │\n",
-              "│ took responsibility for the team\\'s struggles, stating he is \"not good enough\" and has to find solutions. He    │\n",
-              "│ expressed self-doubt but is striving to improve the team\\'s situation step by step. Guardiola has faced         │\n",
-              "│ criticism due to the team\\'s poor form, which has seen them lose several matches and fall behind in the title   │\n",
-              "│ race.\\n\\nHe emphasized the need to restore their defensive strength and regain confidence in their play.        │\n",
-              "│ Guardiola is planning a significant rebuild of the squad to address these challenges, aiming to replace several │\n",
-              "│ regular starters and emphasize improvements in the team\\'s intensity and defensive concepts.'}                  │\n",
-              "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
-              "
\n" - ], - "text/plain": [ - "╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", - "│ Calling tool: 'final_answer' with arguments: {'answer': 'Manchester City manager Pep Guardiola has expressed a │\n", - "│ mix of concern and determination regarding the team\\'s current form. Guardiola admitted that this is the worst │\n", - "│ run of results in his managerial career and that it has affected his sleep and diet. He described his state of │\n", - "│ mind as \"ugly\" and acknowledged that City needs to defend better and avoid making mistakes. Despite his │\n", - "│ personal challenges, Guardiola stated that he is \"fine\" and focused on finding solutions.\\n\\nGuardiola also │\n", - "│ took responsibility for the team\\'s struggles, stating he is \"not good enough\" and has to find solutions. He │\n", - "│ expressed self-doubt but is striving to improve the team\\'s situation step by step. Guardiola has faced │\n", - "│ criticism due to the team\\'s poor form, which has seen them lose several matches and fall behind in the title │\n", - "│ race.\\n\\nHe emphasized the need to restore their defensive strength and regain confidence in their play. │\n", - "│ Guardiola is planning a significant rebuild of the squad to address these challenges, aiming to replace several │\n", - "│ regular starters and emphasize improvements in the team\\'s intensity and defensive concepts.'} │\n", - "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
Final answer: Manchester City manager Pep Guardiola has expressed a mix of concern and determination regarding the \n",
-              "team's current form. Guardiola admitted that this is the worst run of results in his managerial career and that it \n",
-              "has affected his sleep and diet. He described his state of mind as \"ugly\" and acknowledged that City needs to \n",
-              "defend better and avoid making mistakes. Despite his personal challenges, Guardiola stated that he is \"fine\" and \n",
-              "focused on finding solutions.\n",
-              "\n",
-              "Guardiola also took responsibility for the team's struggles, stating he is \"not good enough\" and has to find \n",
-              "solutions. He expressed self-doubt but is striving to improve the team's situation step by step. Guardiola has \n",
-              "faced criticism due to the team's poor form, which has seen them lose several matches and fall behind in the title \n",
-              "race.\n",
-              "\n",
-              "He emphasized the need to restore their defensive strength and regain confidence in their play. Guardiola is \n",
-              "planning a significant rebuild of the squad to address these challenges, aiming to replace several regular starters\n",
-              "and emphasize improvements in the team's intensity and defensive concepts.\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[1;38;2;212;183;2mFinal answer: Manchester City manager Pep Guardiola has expressed a mix of concern and determination regarding the \u001b[0m\n", - "\u001b[1;38;2;212;183;2mteam's current form. Guardiola admitted that this is the worst run of results in his managerial career and that it \u001b[0m\n", - "\u001b[1;38;2;212;183;2mhas affected his sleep and diet. He described his state of mind as \"ugly\" and acknowledged that City needs to \u001b[0m\n", - "\u001b[1;38;2;212;183;2mdefend better and avoid making mistakes. Despite his personal challenges, Guardiola stated that he is \"fine\" and \u001b[0m\n", - "\u001b[1;38;2;212;183;2mfocused on finding solutions.\u001b[0m\n", - "\n", - "\u001b[1;38;2;212;183;2mGuardiola also took responsibility for the team's struggles, stating he is \"not good enough\" and has to find \u001b[0m\n", - "\u001b[1;38;2;212;183;2msolutions. He expressed self-doubt but is striving to improve the team's situation step by step. Guardiola has \u001b[0m\n", - "\u001b[1;38;2;212;183;2mfaced criticism due to the team's poor form, which has seen them lose several matches and fall behind in the title \u001b[0m\n", - "\u001b[1;38;2;212;183;2mrace.\u001b[0m\n", - "\n", - "\u001b[1;38;2;212;183;2mHe emphasized the need to restore their defensive strength and regain confidence in their play. Guardiola is \u001b[0m\n", - "\u001b[1;38;2;212;183;2mplanning a significant rebuild of the squad to address these challenges, aiming to replace several regular starters\u001b[0m\n", - "\u001b[1;38;2;212;183;2mand emphasize improvements in the team's intensity and defensive concepts.\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
[Step 1: Duration 2.74 seconds| Input tokens: 7,162 | Output tokens: 241]\n",
-              "
\n" - ], - "text/plain": [ - "\u001b[2m[Step 1: Duration 2.74 seconds| Input tokens: 7,162 | Output tokens: 241]\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - "\n", - "agent_output = agent.run(query)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Analyzing the Agent\n", - "When the agent runs, smolagents prints out the steps that the agent takes along with the tools called in each step. In the above tool call, two steps occur:\n", - "\n", - "**Step 1**: First, the agent determines that it requires a tool to be used, and the `retriever` tool is called. The agent also specifies the query parameter for the tool (a string). The tool returns semantically similar documents to the query from Couchbase's vector store.\n", - "\n", - "**Step 2**: Next, the agent determines that the context retrieved from the tool is sufficient to answer the question. It then calls the `final_answer` tool, which is predefined for each agent: this tool is called when the agent returns the final answer to the user. In this step, the LLM answers the user's query from the context retrieved in step 1 and passes it to the `final_answer` tool, at which point the agent's execution ends." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Conclusion\n", - "\n", - "By following these steps, you’ll have a fully functional agentic RAG system that leverages the strengths of Couchbase and smolagents, along with OpenAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you’re a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, RAG-driven chat system." - ] - } - ], - "metadata": { - "colab": { - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.11" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/smolagents/gsi/RAG_with_Couchbase_and_SmolAgents.ipynb b/smolagents/gsi/RAG_with_Couchbase_and_SmolAgents.ipynb deleted file mode 100644 index c661d9bd..00000000 --- a/smolagents/gsi/RAG_with_Couchbase_and_SmolAgents.ipynb +++ /dev/null @@ -1,1038 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com) as the embedding and LLM provider, and [Hugging Face smolagents](https://huggingface.co/docs/smolagents/en/index) as an agent framework. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using GSI (Global Secondary Index) from scratch. Alternatively if you want to perform semantic search using the FTS index, please take a look at [this.](https://developer.couchbase.com/tutorial-smolagents-couchbase-rag-with-fts/)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## How to run this tutorial\n", - "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/smolagents/gsi/RAG_with_Couchbase_and_SmolAgents.ipynb).\n", - "\n", - "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Before you start\n", - "### Get Credentials for OpenAI\n", - "Please follow the [instructions](https://platform.openai.com/docs/quickstart) to generate the OpenAI credentials.\n", - "\n", - "### Create and Deploy Your Free Tier Operational cluster on Capella\n", - "\n", - "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", - "\n", - "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", - "\n", - "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", - "\n", - "### Couchbase Capella Configuration\n", - "\n", - "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", - "\n", - "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", - "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setting the Stage: Installing Necessary Libraries\n", - "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install --quiet datasets==4.1.1 langchain-couchbase==0.5.0 langchain-openai==0.3.33 python-dotenv==1.1.1 smolagents==1.21.3\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Importing Necessary Libraries\n", - "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "import json\n", - "import logging\n", - "import os\n", - "import time\n", - "from datetime import timedelta\n", - "\n", - "from couchbase.auth import PasswordAuthenticator\n", - "from couchbase.cluster import Cluster\n", - "from couchbase.exceptions import (CouchbaseException,\n", - " InternalServerFailureException,\n", - " QueryIndexAlreadyExistsException)\n", - "from couchbase.management.buckets import CreateBucketSettings\n", - "from couchbase.options import ClusterOptions\n", - "from datasets import load_dataset\n", - "from dotenv import load_dotenv\n", - "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", - "from langchain_couchbase.vectorstores import DistanceStrategy\n", - "from langchain_couchbase.vectorstores import IndexType\n", - "from langchain_openai import OpenAIEmbeddings\n", - "\n", - "from smolagents import Tool, OpenAIServerModel, ToolCallingAgent\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setup Logging\n", - "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", - "\n", - "# Disable all logging except critical to prevent OpenAI API request logs\n", - "logging.getLogger(\"httpx\").setLevel(logging.CRITICAL)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Loading Sensitive Information\n", - "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", - "\n", - "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "load_dotenv()\n", - "\n", - "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API Key: ')\n", - "\n", - "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", - "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", - "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", - "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", - "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", - "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: smolagents): ') or 'smolagents'\n", - "\n", - "# Check if the variables are correctly loaded\n", - "if not OPENAI_API_KEY:\n", - " raise ValueError(\"Missing OpenAI API Key\")\n", - "\n", - "if 'OPENAI_API_KEY' not in os.environ:\n", - " os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Connecting to the Couchbase Cluster\n", - "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-11-07 16:44:51,506 - INFO - Successfully connected to Couchbase\n" - ] - } - ], - "source": [ - "try:\n", - " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", - " options = ClusterOptions(auth)\n", - " cluster = Cluster(CB_HOST, options)\n", - " cluster.wait_until_ready(timedelta(seconds=5))\n", - " logging.info(\"Successfully connected to Couchbase\")\n", - "except Exception as e:\n", - " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Setting Up Collections in Couchbase\n", - "\n", - "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", - "\n", - "1. Bucket Creation:\n", - " - Checks if specified bucket exists, creates it if not\n", - " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", - " - Note: You will not be able to create a bucket on Capella\n", - "\n", - "2. Scope Management: \n", - " - Verifies if requested scope exists within bucket\n", - " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", - "\n", - "3. Collection Setup:\n", - " - Checks for collection existence within scope\n", - " - Creates collection if it doesn't exist\n", - " - Waits 2 seconds for collection to be ready\n", - "\n", - "Additional Tasks:\n", - "- Clears any existing documents for clean state\n", - "- Implements comprehensive error handling and logging\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-11-07 16:44:53,519 - INFO - Bucket 'travel-sample' exists.\n", - "2025-11-07 16:44:53,527 - INFO - Collection 'smolagents' does not exist. Creating it...\n", - "2025-11-07 16:44:53,575 - INFO - Collection 'smolagents' created successfully.\n", - "2025-11-07 16:44:55,731 - INFO - All documents cleared from the collection.\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", - " try:\n", - " # Check if bucket exists, create if it doesn't\n", - " try:\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", - " except Exception as e:\n", - " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", - " bucket_settings = CreateBucketSettings(\n", - " name=bucket_name,\n", - " bucket_type='couchbase',\n", - " ram_quota_mb=1024,\n", - " flush_enabled=True,\n", - " num_replicas=0\n", - " )\n", - " cluster.buckets().create_bucket(bucket_settings)\n", - " time.sleep(2) # Wait for bucket creation to complete and become available\n", - " bucket = cluster.bucket(bucket_name)\n", - " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", - "\n", - " bucket_manager = bucket.collections()\n", - "\n", - " # Check if scope exists, create if it doesn't\n", - " scopes = bucket_manager.get_all_scopes()\n", - " scope_exists = any(scope.name == scope_name for scope in scopes)\n", - " \n", - " if not scope_exists and scope_name != \"_default\":\n", - " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_scope(scope_name)\n", - " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", - "\n", - " # Check if collection exists, create if it doesn't\n", - " collections = bucket_manager.get_all_scopes()\n", - " collection_exists = any(\n", - " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", - " for scope in collections\n", - " )\n", - "\n", - " if not collection_exists:\n", - " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", - " bucket_manager.create_collection(scope_name, collection_name)\n", - " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", - " else:\n", - " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", - "\n", - " # Wait for collection to be ready\n", - " collection = bucket.scope(scope_name).collection(collection_name)\n", - " time.sleep(2) # Give the collection time to be ready for queries\n", - "\n", - " # Clear all documents in the collection\n", - " try:\n", - " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", - " cluster.query(query).execute()\n", - " logging.info(\"All documents cleared from the collection.\")\n", - " except Exception as e:\n", - " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", - "\n", - " return collection\n", - " except Exception as e:\n", - " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", - " \n", - "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Creating OpenAI Embeddings\n", - "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-11-07 16:44:58,634 - INFO - Successfully created OpenAIEmbeddings\n" - ] - } - ], - "source": [ - "try:\n", - " embeddings = OpenAIEmbeddings(\n", - " model=\"text-embedding-3-small\",\n", - " api_key=OPENAI_API_KEY,\n", - " )\n", - " logging.info(\"Successfully created OpenAIEmbeddings\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Understanding GSI Vector Search\n", - "\n", - "### Optimizing Vector Search with Global Secondary Index (GSI)\n", - "\n", - "With Couchbase 8.0+, you can leverage the power of GSI-based vector search, which offers significant performance improvements over traditional Full-Text Search (FTS) approaches for vector-first workloads. GSI vector search provides high-performance vector similarity search with advanced filtering capabilities and is designed to scale to billions of vectors.\n", - "\n", - "#### GSI vs FTS: Choosing the Right Approach\n", - "\n", - "| Feature | GSI Vector Search | FTS Vector Search |\n", - "| --------------------- | --------------------------------------------------------------- | ----------------------------------------- |\n", - "| **Best For** | Vector-first workloads, complex filtering, high QPS performance| Hybrid search and high recall rates |\n", - "| **Couchbase Version** | 8.0.0+ | 7.6+ |\n", - "| **Filtering** | Pre-filtering with `WHERE` clauses (Composite) or post-filtering (BHIVE) | Pre-filtering with flexible ordering |\n", - "| **Scalability** | Up to billions of vectors (BHIVE) | Up to 10 million vectors |\n", - "| **Performance** | Optimized for concurrent operations with low memory footprint | Good for mixed text and vector queries |\n", - "\n", - "\n", - "#### GSI Vector Index Types\n", - "\n", - "Couchbase offers two distinct GSI vector index types, each optimized for different use cases:\n", - "\n", - "##### Hyperscale Vector Indexes (BHIVE)\n", - "\n", - "- **Best for**: Pure vector searches like content discovery, recommendations, and semantic search\n", - "- **Use when**: You primarily perform vector-only queries without complex scalar filtering\n", - "- **Features**: \n", - " - High performance with low memory footprint\n", - " - Optimized for concurrent operations\n", - " - Designed to scale to billions of vectors\n", - " - Supports post-scan filtering for basic metadata filtering\n", - "\n", - "##### Composite Vector Indexes\n", - "\n", - " - **Best for**: Filtered vector searches that combine vector similarity with scalar value filtering\n", - "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", - "- **Features**: \n", - " - Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", - " - Best for well-defined workloads requiring complex filtering using GSI features\n", - " - Supports range lookups combined with vector search\n", - "\n", - "#### Index Type Selection for This Tutorial\n", - "\n", - "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using GSI. BHIVE is ideal for semantic search scenarios where you want:\n", - "\n", - "1. **High-performance vector search** across large datasets\n", - "2. **Low latency** for real-time applications\n", - "3. **Scalability** to handle growing vector collections\n", - "4. **Concurrent operations** for multi-user environments\n", - "\n", - "The BHIVE index will provide optimal performance for our OpenAI embedding-based semantic search implementation.\n", - "\n", - "#### Alternative: Composite Vector Index\n", - "\n", - "If your use case requires complex filtering with scalar attributes, you may want to consider using a **Composite Vector Index** instead:\n", - "\n", - "```python\n", - "# Alternative: Create a Composite index for filtered searches\n", - "vector_store.create_index(\n", - " index_type=IndexType.COMPOSITE,\n", - " index_description=\"IVF,SQ8\",\n", - " distance_metric=DistanceStrategy.COSINE,\n", - " index_name=\"pydantic_composite_index\",\n", - ")\n", - "```\n", - "\n", - "**Use Composite indexes when:**\n", - "- You need to filter by document metadata or attributes before vector similarity\n", - "- Your queries combine vector search with WHERE clauses\n", - "- You have well-defined filtering requirements that can reduce the search space\n", - "\n", - "**Note**: Composite indexes enable pre-filtering with scalar attributes, making them ideal for applications where you need to search within specific categories, date ranges, or user-specific data segments.\n", - "\n", - "#### Understanding GSI Index Configuration (Couchbase 8.0 Feature)\n", - "\n", - "Before creating our BHIVE index, it's important to understand the configuration parameters that optimize vector storage and search performance. The `index_description` parameter controls how Couchbase optimizes vector storage through centroids and quantization.\n", - "\n", - "##### Index Description Format: `'IVF[],{PQ|SQ}'`\n", - "\n", - "##### Centroids (IVF - Inverted File)\n", - "\n", - "- Controls how the dataset is subdivided for faster searches\n", - "- **More centroids** = faster search, slower training time\n", - "- **Fewer centroids** = slower search, faster training time\n", - "- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size\n", - "\n", - "###### Quantization Options\n", - "\n", - "**Scalar Quantization (SQ):**\n", - "- `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)\n", - "- Lower memory usage, faster search, slightly reduced accuracy\n", - "\n", - "**Product Quantization (PQ):**\n", - "- Format: `PQx` (e.g., `PQ32x8`)\n", - "- Better compression for very large datasets\n", - "- More complex but can maintain accuracy with smaller index size\n", - "\n", - "##### Common Configuration Examples\n", - "\n", - "- **`IVF,SQ8`** - Auto centroids, 8-bit scalar quantization (good default)\n", - "- **`IVF1000,SQ6`** - 1000 centroids, 6-bit scalar quantization\n", - "- **`IVF,PQ32x8`** - Auto centroids, 32 subquantizers with 8 bits\n", - "\n", - "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", - "\n", - "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", - "\n", - "##### Our Configuration Choice\n", - "\n", - "In this tutorial, we use `IVF,SQ8` which provides:\n", - "- **Auto-selected centroids** optimized for our dataset size\n", - "- **8-bit scalar quantization** for good balance of speed, memory usage, and accuracy\n", - "- **COSINE distance metric** ideal for semantic similarity search\n", - "- **Optimal performance** for most semantic search use cases" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up the Couchbase Query Vector Store\n", - "A vector store is where we'll keep our embeddings. The query vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, GSI converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables us to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used.\n", - "\n", - "The vector store requires a distance metric to determine how similarity between vectors is calculated. This is crucial for accurate semantic search results as different distance metrics can yield different similarity rankings. Some of the supported Distance strategies are dot, l2, euclidean, cosine, l2_squared, euclidean_squared. In our implementation we will use cosine which is particularly effective for text embeddings.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "try:\n", - " vector_store = CouchbaseQueryVectorStore(\n", - " cluster=cluster,\n", - " bucket_name=CB_BUCKET_NAME,\n", - " scope_name=SCOPE_NAME,\n", - " collection_name=COLLECTION_NAME,\n", - " embedding=embeddings,\n", - " distance_metric=DistanceStrategy.COSINE\n", - " )\n", - " logging.info(\"Successfully created vector store\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Load the BBC News Dataset\n", - "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", - "\n", - "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "try:\n", - " news_dataset = load_dataset(\n", - " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", - " )\n", - " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", - " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Cleaning up the Data\n", - "We will use the content of the news articles for our RAG system.\n", - "\n", - "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], - "source": [ - "news_articles = news_dataset[\"content\"]\n", - "unique_articles = set()\n", - "for article in news_articles:\n", - " if article:\n", - " unique_articles.add(article)\n", - "unique_news_articles = list(unique_articles)\n", - "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving Data to the Vector Store\n", - "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", - "\n", - "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", - "\n", - "This approach offers several benefits:\n", - "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", - "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", - "3. Resource Management: Better control over CPU and network resource utilization\n", - "\n", - "We use a conservative batch size of 100 to ensure reliable operation.\n", - "The optimal batch size depends on many factors including:\n", - "- Document sizes being inserted\n", - "- Available system resources\n", - "- Network conditions\n", - "- Concurrent workload\n", - "\n", - "Consider measuring performance with your specific workload before adjusting.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-11-07 16:46:18,967 - INFO - Document ingestion completed successfully.\n" - ] - } - ], - "source": [ - "batch_size = 100\n", - "\n", - "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", - "\n", - "try:\n", - " vector_store.add_texts(\n", - " texts=articles,\n", - " batch_size=batch_size\n", - " )\n", - " logging.info(\"Document ingestion completed successfully.\")\n", - "except Exception as e:\n", - " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Perform Semantic Search\n", - "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", - "\n", - "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Vector Search Performance Optimization\n", - "\n", - "Now let's measure and compare the performance benefits of different optimization strategies. We'll conduct a comprehensive performance analysis across two phases:\n", - "\n", - "## Performance Testing Phases\n", - "\n", - "1. **Phase 1 - Baseline Performance**: Test vector search without GSI indexes to establish baseline metrics\n", - "2. **Phase 2 - GSI-Optimized Search**: Create BHIVE index and measure performance improvements\n", - "\n", - "**Important Context:**\n", - "- GSI performance benefits scale with dataset size and concurrent load\n", - "- With our dataset (~1,700 articles), improvements may be modest\n", - "- Production environments with millions of vectors show significant GSI advantages\n", - "- The combination of GSI + LLM caching provides optimal RAG performance\n" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "================================================================================\n", - "PHASE 1: BASELINE PERFORMANCE (NO GSI INDEX)\n", - "================================================================================\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-11-07 16:46:24,561 - INFO - Semantic search completed in 1.34 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 1.34 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Vector Distance: 0.2956, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", - "\n", - "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", - "\n", - "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", - "--------------------------------------------------------------------------------\n", - "Vector Distance: 0.3100, Text: Pep Guardiola has said Manchester City will be his final managerial job in club football before he \"maybe\" coaches a national team.\n", - "\n", - "--------------------------------------------------------------------------------\n" - ] - } - ], - "source": [ - "# Phase 1: Baseline Performance (Without GSI Index)\n", - "print(\"=\"*80)\n", - "print(\"PHASE 1: BASELINE PERFORMANCE (NO GSI INDEX)\")\n", - "print(\"=\"*80)\n", - "\n", - "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " baseline_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {baseline_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {baseline_time:.2f} seconds):\")\n", - " print(\"-\" * 80) # Add separator line\n", - " for doc, distance in search_results:\n", - " print(f\"Vector Distance: {distance:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80) # Add separator between results\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"smolagents_bhive_index\", index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: To create a COMPOSITE index, the below code can be used.\n", - "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles.\n", - "\n", - "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"pydantic_ai_composite_index\", index_description=\"IVF,SQ8\")" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "================================================================================\n", - "PHASE 2: GSI-OPTIMIZED PERFORMANCE (WITH BHIVE INDEX)\n", - "================================================================================\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2025-11-07 16:47:01,538 - INFO - Semantic search completed in 0.42 seconds\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Semantic Search Results (completed in 0.42 seconds):\n", - "--------------------------------------------------------------------------------\n", - "Vector Distance: 0.2956, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", - "\n", - "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", - "\n", - "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", - "--------------------------------------------------------------------------------\n", - "Vector Distance: 0.3100, Text: Pep Guardiola has said Manchester City will be his final managerial job in club football before he \"maybe\" coaches a national team.\n", - "--------------------------------------------------------------------------------\n" - ] - } - ], - "source": [ - "# Phase 2: GSI-Optimized Performance (With BHIVE Index)\n", - "print(\"\\n\" + \"=\"*80)\n", - "print(\"PHASE 2: GSI-OPTIMIZED PERFORMANCE (WITH BHIVE INDEX)\")\n", - "print(\"=\"*80)\n", - "\n", - "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - "\n", - "try:\n", - " # Perform the semantic search\n", - " start_time = time.time()\n", - " search_results = vector_store.similarity_search_with_score(query, k=10)\n", - " gsi_time = time.time() - start_time\n", - "\n", - " logging.info(f\"Semantic search completed in {gsi_time:.2f} seconds\")\n", - "\n", - " # Display search results\n", - " print(f\"\\nSemantic Search Results (completed in {gsi_time:.2f} seconds):\")\n", - " print(\"-\" * 80) # Add separator line\n", - " for doc, distance in search_results:\n", - " print(f\"Vector Distance: {distance:.4f}, Text: {doc.page_content}\")\n", - " print(\"-\" * 80) # Add separator between results\n", - "\n", - "except CouchbaseException as e:\n", - " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", - "except Exception as e:\n", - " raise RuntimeError(f\"Unexpected error: {str(e)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Performance Analysis Summary\n", - "\n", - "Let's analyze the performance improvements we've achieved through different optimization strategies:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "================================================================================\n", - "VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", - "================================================================================\n", - "\n", - "📊 Performance Comparison:\n", - "Optimization Level Time (seconds) Status\n", - "--------------------------------------------------------------------------------\n", - "Phase 1 - Baseline (No Index) 1.3410 ⚪ Baseline\n", - "Phase 2 - GSI-Optimized (BHIVE) 0.4157 ✅ Optimized\n", - "\n", - "✨ GSI Performance Gain: 3.23x faster (69.0% improvement)\n", - "\n", - "--------------------------------------------------------------------------------\n", - "KEY INSIGHTS:\n", - "--------------------------------------------------------------------------------\n", - "1. 🚀 GSI Optimization:\n", - " • BHIVE indexes excel with large-scale datasets (millions+ vectors)\n", - " • Performance gains increase with dataset size and concurrent queries\n", - " • Optimal for production workloads with sustained traffic patterns\n", - "\n", - "2. 📦 Dataset Size Impact:\n", - " • Current dataset: ~1,700 articles\n", - " • At this scale, performance differences may be minimal or variable\n", - " • Significant gains typically seen with 10M+ vectors\n", - "\n", - "3. 🎯 When to Use GSI:\n", - " • Large-scale vector search applications\n", - " • High query-per-second (QPS) requirements\n", - " • Multi-user concurrent access scenarios\n", - " • Production environments requiring scalability\n", - "\n", - "================================================================================\n" - ] - } - ], - "source": [ - "print(\"\\n\" + \"=\"*80)\n", - "print(\"VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", - "print(\"=\"*80)\n", - "\n", - "print(f\"\\n📊 Performance Comparison:\")\n", - "print(f\"{'Optimization Level':<35} {'Time (seconds)':<20} {'Status'}\")\n", - "print(\"-\" * 80)\n", - "print(f\"{'Phase 1 - Baseline (No Index)':<35} {baseline_time:.4f}{'':16} ⚪ Baseline\")\n", - "print(f\"{'Phase 2 - GSI-Optimized (BHIVE)':<35} {gsi_time:.4f}{'':16} ✅ Optimized\")\n", - "\n", - "# Calculate improvement\n", - "if baseline_time > gsi_time:\n", - " speedup = baseline_time / gsi_time\n", - " improvement = ((baseline_time - gsi_time) / baseline_time) * 100\n", - " print(f\"\\n✨ GSI Performance Gain: {speedup:.2f}x faster ({improvement:.1f}% improvement)\")\n", - "elif gsi_time > baseline_time:\n", - " slowdown_pct = ((gsi_time - baseline_time) / baseline_time) * 100\n", - " print(f\"\\n⚠️ Note: GSI was {slowdown_pct:.1f}% slower than baseline in this run\")\n", - " print(f\" This can happen with small datasets. GSI benefits emerge with scale.\")\n", - "else:\n", - " print(f\"\\n⚖️ Performance: Comparable to baseline\")\n", - "\n", - "print(\"\\n\" + \"-\"*80)\n", - "print(\"KEY INSIGHTS:\")\n", - "print(\"-\"*80)\n", - "print(\"1. 🚀 GSI Optimization:\")\n", - "print(\" • BHIVE indexes excel with large-scale datasets (millions+ vectors)\")\n", - "print(\" • Performance gains increase with dataset size and concurrent queries\")\n", - "print(\" • Optimal for production workloads with sustained traffic patterns\")\n", - "\n", - "print(\"\\n2. 📦 Dataset Size Impact:\")\n", - "print(f\" • Current dataset: ~1,700 articles\")\n", - "print(\" • At this scale, performance differences may be minimal or variable\")\n", - "print(\" • Significant gains typically seen with 10M+ vectors\")\n", - "\n", - "print(\"\\n3. 🎯 When to Use GSI:\")\n", - "print(\" • Large-scale vector search applications\")\n", - "print(\" • High query-per-second (QPS) requirements\")\n", - "print(\" • Multi-user concurrent access scenarios\")\n", - "print(\" • Production environments requiring scalability\")\n", - "\n", - "print(\"\\n\" + \"=\"*80)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## smolagents: An Introduction\n", - "[smolagents](https://huggingface.co/docs/smolagents/en/index) is a agentic framework by Hugging Face for easy creation of agents in a few lines of code.\n", - "\n", - "Some of the features of smolagents are:\n", - "\n", - "- ✨ Simplicity: the logic for agents fits in ~1,000 lines of code (see agents.py). We kept abstractions to their minimal shape above raw code!\n", - "\n", - "- 🧑‍💻 First-class support for Code Agents. Our CodeAgent writes its actions in code (as opposed to \"agents being used to write code\"). To make it secure, we support executing in sandboxed environments via E2B.\n", - "\n", - "- 🤗 Hub integrations: you can share/pull tools to/from the Hub, and more is to come!\n", - "\n", - "- 🌐 Model-agnostic: smolagents supports any LLM. It can be a local transformers or ollama model, one of many providers on the Hub, or any model from OpenAI, Anthropic and many others via our LiteLLM integration.\n", - "\n", - "- 👁️ Modality-agnostic: Agents support text, vision, video, even audio inputs! Cf this tutorial for vision.\n", - "\n", - "- 🛠️ Tool-agnostic: you can use tools from LangChain, Anthropic's MCP, you can even use a Hub Space as a tool.\n", - "\n", - "# Building a RAG Agent using smolagents\n", - "\n", - "smolagents allows users to define their own tools for the agent to use. These tools can be of two types:\n", - "1. Tools defined as classes: These tools are subclassed from the `Tool` class and must override the `forward` method, which is called when the tool is used.\n", - "2. Tools defined as functions: These are simple functions that are called when the tool is used, and are decorated with the `@tool` decorator.\n", - "\n", - "In our case, we will use the first method, and we define our `RetrieverTool` below. We define a name, a description and a dictionary of inputs that the tool accepts. This helps the LLM properly identify and use the tool.\n", - "\n", - "The `RetrieverTool` is simple: it takes a query generated by the user, and uses Couchbase's performant vector search service under the hood to search for semantically similar documents to the query. The LLM can then use this context to answer the user's question.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "class RetrieverTool(Tool):\n", - " name = \"retriever\"\n", - " description = \"Uses semantic search to retrieve the parts of news documentation that could be most relevant to answer your query.\"\n", - " inputs = {\n", - " \"query\": {\n", - " \"type\": \"string\",\n", - " \"description\": \"The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.\",\n", - " }\n", - " }\n", - " output_type = \"string\"\n", - "\n", - " def __init__(self, vector_store: CouchbaseQueryVectorStore, **kwargs):\n", - " super().__init__(**kwargs)\n", - " self.vector_store = vector_store\n", - "\n", - " def forward(self, query: str) -> str:\n", - " assert isinstance(query, str), \"Query must be a string\"\n", - "\n", - " docs = self.vector_store.similarity_search_with_score(query, k=5)\n", - " return \"\\n\\n\".join(\n", - " f\"# Documents:\\n{doc.page_content}\"\n", - " for doc, distance in docs\n", - " )\n", - "\n", - "retriever_tool = RetrieverTool(vector_store)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Defining Our Agent\n", - "smolagents have predefined configurations for agents that we can use. We use the `ToolCallingAgent`, which writes its tool calls in a JSON format. Alternatively, there also exists a `CodeAgent`, in which the LLM defines it's functions in code.\n", - "\n", - "The `CodeAgent` is offers benefits in certain challenging scenarios: it can lead to [higher performance in difficult benchmarks](https://huggingface.co/papers/2411.01747) and use [30% fewer steps to solve problems](https://huggingface.co/papers/2402.01030). However, since our use case is just a simple RAG tool, a `ToolCallingAgent` will suffice.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [], - "source": [ - "agent = ToolCallingAgent(\n", - " tools=[retriever_tool],\n", - " model=OpenAIServerModel(\n", - " model_id=\"gpt-4o-2024-08-06\",\n", - " api_key=OPENAI_API_KEY,\n", - " ),\n", - " max_steps=4,\n", - " verbosity_level=2\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Running our Agent\n", - "We have now finished setting up our vector store and agent! The system is now ready to accept queries.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", - "\n", - "agent_output = agent.run(query)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Analyzing the Agent\n", - "When the agent runs, smolagents prints out the steps that the agent takes along with the tools called in each step. In the above tool call, two steps occur:\n", - "\n", - "**Step 1**: First, the agent determines that it requires a tool to be used, and the `retriever` tool is called. The agent also specifies the query parameter for the tool (a string). The tool returns semantically similar documents to the query from Couchbase's vector store.\n", - "\n", - "**Step 2**: Next, the agent determines that the context retrieved from the tool is sufficient to answer the question. It then calls the `final_answer` tool, which is predefined for each agent: this tool is called when the agent returns the final answer to the user. In this step, the LLM answers the user's query from the context retrieved in step 1 and passes it to the `final_answer` tool, at which point the agent's execution ends.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "By following these steps, you'll have a fully functional agentic RAG system that leverages the strengths of Couchbase and smolagents, along with OpenAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively using GSI which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, RAG-driven chat system using smolagents' agent framework.\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/smolagents/gsi/.env.sample b/smolagents/query_based/.env.sample similarity index 100% rename from smolagents/gsi/.env.sample rename to smolagents/query_based/.env.sample diff --git a/smolagents/query_based/RAG_with_Couchbase_and_SmolAgents.ipynb b/smolagents/query_based/RAG_with_Couchbase_and_SmolAgents.ipynb new file mode 100644 index 00000000..b0aea707 --- /dev/null +++ b/smolagents/query_based/RAG_with_Couchbase_and_SmolAgents.ipynb @@ -0,0 +1,1038 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com) as the embedding and LLM provider, and [Hugging Face smolagents](https://huggingface.co/docs/smolagents/en/index) as an agent framework. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system using Couchbase Hyperscale and Composite Vector Indexes from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Search Vector Index, please take a look at [this.](https://developer.couchbase.com/tutorial-smolagents-couchbase-rag-with-search-vector-index/)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/smolagents/gsi/RAG_with_Couchbase_and_SmolAgents.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Before you start\n", + "### Get Credentials for OpenAI\n", + "Please follow the [instructions](https://platform.openai.com/docs/quickstart) to generate the OpenAI credentials.\n", + "\n", + "### Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "Note: To run this this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --quiet datasets==4.1.1 langchain-couchbase==0.5.0 langchain-openai==0.3.33 python-dotenv==1.1.1 smolagents==1.21.3\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (CouchbaseException,\n", + " InternalServerFailureException,\n", + " QueryIndexAlreadyExistsException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy\n", + "from langchain_couchbase.vectorstores import IndexType\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "from smolagents import Tool, OpenAIServerModel, ToolCallingAgent\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)\n", + "\n", + "# Disable all logging except critical to prevent OpenAI API request logs\n", + "logging.getLogger(\"httpx\").setLevel(logging.CRITICAL)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading Sensitive Information\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv()\n", + "\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API Key: ')\n", + "\n", + "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: query-vector-search-testing): ') or 'query-vector-search-testing'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: smolagents): ') or 'smolagents'\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not OPENAI_API_KEY:\n", + " raise ValueError(\"Missing OpenAI API Key\")\n", + "\n", + "if 'OPENAI_API_KEY' not in os.environ:\n", + " os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-11-07 16:44:51,506 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setting Up Collections in Couchbase\n", + "\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "\n", + "2. Scope Management: \n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-11-07 16:44:53,519 - INFO - Bucket 'travel-sample' exists.\n", + "2025-11-07 16:44:53,527 - INFO - Collection 'smolagents' does not exist. Creating it...\n", + "2025-11-07 16:44:53,575 - INFO - Collection 'smolagents' created successfully.\n", + "2025-11-07 16:44:55,731 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating OpenAI Embeddings\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-11-07 16:44:58,634 - INFO - Successfully created OpenAIEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = OpenAIEmbeddings(\n", + " model=\"text-embedding-3-small\",\n", + " api_key=OPENAI_API_KEY,\n", + " )\n", + " logging.info(\"Successfully created OpenAIEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Understanding GSI Vector Search\n", + "\n", + "### Optimizing Vector Search with Hyperscale and Composite Vector Indexes\n", + "\n", + "With Couchbase 8.0+, you can leverage the power of GSI-based vector search, which offers significant performance improvements over traditional Full-Text Search (FTS) approaches for vector-first workloads. GSI vector search provides high-performance vector similarity search with advanced filtering capabilities and is designed to scale to billions of vectors.\n", + "\n", + "#### GSI vs FTS: Choosing the Right Approach\n", + "\n", + "| Feature | GSI Vector Search | FTS Vector Search |\n", + "| --------------------- | --------------------------------------------------------------- | ----------------------------------------- |\n", + "| **Best For** | Vector-first workloads, complex filtering, high QPS performance| Hybrid search and high recall rates |\n", + "| **Couchbase Version** | 8.0.0+ | 7.6+ |\n", + "| **Filtering** | Pre-filtering with `WHERE` clauses (Composite) or post-filtering (BHIVE) | Pre-filtering with flexible ordering |\n", + "| **Scalability** | Up to billions of vectors (BHIVE) | Up to 10 million vectors |\n", + "| **Performance** | Optimized for concurrent operations with low memory footprint | Good for mixed text and vector queries |\n", + "\n", + "\n", + "#### GSI Vector Index Types\n", + "\n", + "Couchbase offers two distinct GSI vector index types, each optimized for different use cases:\n", + "\n", + "##### Hyperscale Vector Indexes (BHIVE)\n", + "\n", + "- **Best for**: Pure vector searches like content discovery, recommendations, and semantic search\n", + "- **Use when**: You primarily perform vector-only queries without complex scalar filtering\n", + "- **Features**: \n", + " - High performance with low memory footprint\n", + " - Optimized for concurrent operations\n", + " - Designed to scale to billions of vectors\n", + " - Supports post-scan filtering for basic metadata filtering\n", + "\n", + "##### Composite Vector Indexes\n", + "\n", + " - **Best for**: Filtered vector searches that combine vector similarity with scalar value filtering\n", + "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", + "- **Features**: \n", + " - Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", + " - Best for well-defined workloads requiring complex filtering using Hyperscale and Composite Vector Index features\n", + " - Supports range lookups combined with vector search\n", + "\n", + "#### Index Type Selection for This Tutorial\n", + "\n", + "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using Hyperscale and Composite Vector Indexes. BHIVE is ideal for semantic search scenarios where you want:\n", + "\n", + "1. **High-performance vector search** across large datasets\n", + "2. **Low latency** for real-time applications\n", + "3. **Scalability** to handle growing vector collections\n", + "4. **Concurrent operations** for multi-user environments\n", + "\n", + "The BHIVE index will provide optimal performance for our OpenAI embedding-based semantic search implementation.\n", + "\n", + "#### Alternative: Composite Vector Index\n", + "\n", + "If your use case requires complex filtering with scalar attributes, you may want to consider using a **Composite Vector Index** instead:\n", + "\n", + "```python\n", + "# Alternative: Create a Composite index for filtered searches\n", + "vector_store.create_index(\n", + " index_type=IndexType.COMPOSITE,\n", + " index_description=\"IVF,SQ8\",\n", + " distance_metric=DistanceStrategy.COSINE,\n", + " index_name=\"pydantic_composite_index\",\n", + ")\n", + "```\n", + "\n", + "**Use Composite indexes when:**\n", + "- You need to filter by document metadata or attributes before vector similarity\n", + "- Your queries combine vector search with WHERE clauses\n", + "- You have well-defined filtering requirements that can reduce the search space\n", + "\n", + "**Note**: Composite indexes enable pre-filtering with scalar attributes, making them ideal for applications where you need to search within specific categories, date ranges, or user-specific data segments.\n", + "\n", + "#### Understanding GSI Index Configuration (Couchbase 8.0 Feature)\n", + "\n", + "Before creating our BHIVE index, it's important to understand the configuration parameters that optimize vector storage and search performance. The `index_description` parameter controls how Couchbase optimizes vector storage through centroids and quantization.\n", + "\n", + "##### Index Description Format: `'IVF[],{PQ|SQ}'`\n", + "\n", + "##### Centroids (IVF - Inverted File)\n", + "\n", + "- Controls how the dataset is subdivided for faster searches\n", + "- **More centroids** = faster search, slower training time\n", + "- **Fewer centroids** = slower search, faster training time\n", + "- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size\n", + "\n", + "###### Quantization Options\n", + "\n", + "**Scalar Quantization (SQ):**\n", + "- `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)\n", + "- Lower memory usage, faster search, slightly reduced accuracy\n", + "\n", + "**Product Quantization (PQ):**\n", + "- Format: `PQx` (e.g., `PQ32x8`)\n", + "- Better compression for very large datasets\n", + "- More complex but can maintain accuracy with smaller index size\n", + "\n", + "##### Common Configuration Examples\n", + "\n", + "- **`IVF,SQ8`** - Auto centroids, 8-bit scalar quantization (good default)\n", + "- **`IVF1000,SQ6`** - 1000 centroids, 6-bit scalar quantization\n", + "- **`IVF,PQ32x8`** - Auto centroids, 32 subquantizers with 8 bits\n", + "\n", + "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", + "\n", + "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", + "\n", + "##### Our Configuration Choice\n", + "\n", + "In this tutorial, we use `IVF,SQ8` which provides:\n", + "- **Auto-selected centroids** optimized for our dataset size\n", + "- **8-bit scalar quantization** for good balance of speed, memory usage, and accuracy\n", + "- **COSINE distance metric** ideal for semantic similarity search\n", + "- **Optimal performance** for most semantic search use cases" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up the Couchbase Query Vector Store\n", + "A vector store is where we'll keep our embeddings. The query vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, GSI converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables us to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used.\n", + "\n", + "The vector store requires a distance metric to determine how similarity between vectors is calculated. This is crucial for accurate semantic search results as different distance metrics can yield different similarity rankings. Some of the supported Distance strategies are dot, l2, euclidean, cosine, l2_squared, euclidean_squared. In our implementation we will use cosine which is particularly effective for text embeddings.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " distance_metric=DistanceStrategy.COSINE\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "3. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 100 to ensure reliable operation.\n", + "The optimal batch size depends on many factors including:\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-11-07 16:46:18,967 - INFO - Document ingestion completed successfully.\n" + ] + } + ], + "source": [ + "batch_size = 100\n", + "\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=batch_size\n", + " )\n", + " logging.info(\"Document ingestion completed successfully.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Perform Semantic Search\n", + "Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors. The similarity metric used for this comparison is configurable, allowing flexibility in how the relevance of documents is determined. Common metrics include cosine similarity, Euclidean distance, or dot product, but other metrics can be implemented based on specific use cases. Different embedding models like BERT, Word2Vec, or GloVe can also be used depending on the application's needs, with the vectors generated by these models stored and searched within Couchbase itself.\n", + "\n", + "In the provided code, the search process begins by recording the start time, followed by executing the `similarity_search_with_score` method of the `CouchbaseQueryVectorStore`. This method searches Couchbase for the most relevant documents based on the vector similarity to the query. The search results include the document content and the distance that reflects how closely each document aligns with the query in the defined semantic space. The time taken to perform this search is then calculated and logged, and the results are displayed, showing the most relevant documents along with their similarity scores. This approach leverages Couchbase as both a storage and retrieval engine for vector data, enabling efficient and scalable semantic searches. The integration of vector storage and search capabilities within Couchbase allows for sophisticated semantic search operations without relying on external services for vector storage or comparison.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Vector Search Performance Optimization\n", + "\n", + "Now let's measure and compare the performance benefits of different optimization strategies. We'll conduct a comprehensive performance analysis across two phases:\n", + "\n", + "## Performance Testing Phases\n", + "\n", + "1. **Phase 1 - Baseline Performance**: Test vector search without Hyperscale or Composite Vector Indexes to establish baseline metrics\n", + "2. **Phase 2 - Vector Index-Optimized Search**: Create BHIVE index and measure performance improvements\n", + "\n", + "**Important Context:**\n", + "- GSI performance benefits scale with dataset size and concurrent load\n", + "- With our dataset (~1,700 articles), improvements may be modest\n", + "- Production environments with millions of vectors show significant GSI advantages\n", + "- The combination of GSI + LLM caching provides optimal RAG performance\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "================================================================================\n", + "PHASE 1: BASELINE PERFORMANCE (NO GSI INDEX)\n", + "================================================================================\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-11-07 16:46:24,561 - INFO - Semantic search completed in 1.34 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 1.34 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Vector Distance: 0.2956, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", + "\n", + "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", + "\n", + "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", + "--------------------------------------------------------------------------------\n", + "Vector Distance: 0.3100, Text: Pep Guardiola has said Manchester City will be his final managerial job in club football before he \"maybe\" coaches a national team.\n", + "\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# Phase 1: Baseline Performance (Without GSI Index)\n", + "print(\"=\"*80)\n", + "print(\"PHASE 1: BASELINE PERFORMANCE (NO GSI INDEX)\")\n", + "print(\"=\"*80)\n", + "\n", + "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " baseline_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {baseline_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {baseline_time:.2f} seconds):\")\n", + " print(\"-\" * 80) # Add separator line\n", + " for doc, distance in search_results:\n", + " print(f\"Vector Distance: {distance:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80) # Add separator between results\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "vector_store.create_index(index_type=IndexType.BHIVE, index_name=\"smolagents_bhive_index\", index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: To create a COMPOSITE index, the below code can be used.\n", + "Choose based on your specific use case and query patterns. For this tutorial's news search scenario, either index type would work, but BHIVE might be more efficient for pure semantic search across news articles.\n", + "\n", + "vector_store.create_index(index_type=IndexType.COMPOSITE, index_name=\"pydantic_ai_composite_index\", index_description=\"IVF,SQ8\")" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "================================================================================\n", + "PHASE 2: GSI-OPTIMIZED PERFORMANCE (WITH BHIVE INDEX)\n", + "================================================================================\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-11-07 16:47:01,538 - INFO - Semantic search completed in 0.42 seconds\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Semantic Search Results (completed in 0.42 seconds):\n", + "--------------------------------------------------------------------------------\n", + "Vector Distance: 0.2956, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016\n", + "\n", + "Manchester City boss Pep Guardiola says he is \"fine\" despite admitting his sleep and diet are being affected by the worst run of results in his entire managerial career. In an interview with former Italy international Luca Toni for Amazon Prime Sport before Wednesday's Champions League defeat by Juventus, Guardiola touched on the personal impact City's sudden downturn in form has had. Guardiola said his state of mind was \"ugly\", that his sleep was \"worse\" and he was eating lighter as his digestion had suffered. City go into Sunday's derby against Manchester United at Etihad Stadium having won just one of their past 10 games. The Juventus loss means there is a chance they may not even secure a play-off spot in the Champions League. Asked to elaborate on his comments to Toni, Guardiola said: \"I'm fine. \"In our jobs we always want to do our best or the best as possible. When that doesn't happen you are more uncomfortable than when the situation is going well, always that happened. \"In good moments I am happier but when I get to the next game I am still concerned about what I have to do. There is no human being that makes an activity and it doesn't matter how they do.\" Guardiola said City have to defend better and \"avoid making mistakes at both ends\". To emphasise his point, Guardiola referred back to the third game of City's current run, against a Sporting side managed by Ruben Amorim, who will be in the United dugout at the weekend. City dominated the first half in Lisbon, led thanks to Phil Foden's early effort and looked to be cruising. Instead, they conceded three times in 11 minutes either side of half-time as Sporting eventually ran out 4-1 winners. \"I would like to play the game like we played in Lisbon on Sunday, believe me,\" said Guardiola, who is facing the prospect of only having three fit defenders for the derby as Nathan Ake and Manuel Akanji try to overcome injury concerns. If there is solace for City, it comes from the knowledge United are not exactly flying. Their comeback Europa League victory against Viktoria Plzen on Thursday was their third win of Amorim's short reign so far but only one of those successes has come in the Premier League, where United have lost their past two games against Arsenal and Nottingham Forest. Nevertheless, Guardiola can see improvements already on the red side of the city. \"It's already there,\" he said. \"You see all the patterns, the movements, the runners and the pace. He will do a good job at United, I'm pretty sure of that.\"\n", + "\n", + "Guardiola says skipper Kyle Walker has been offered support by the club after the City defender highlighted the racial abuse he had received on social media in the wake of the Juventus trip. \"It's unacceptable,\" he said. \"Not because it's Kyle - for any human being. \"Unfortunately it happens many times in the real world. It is not necessary to say he has the support of the entire club. It is completely unacceptable and we give our support to him.\"\n", + "--------------------------------------------------------------------------------\n", + "Vector Distance: 0.3100, Text: Pep Guardiola has said Manchester City will be his final managerial job in club football before he \"maybe\" coaches a national team.\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# Phase 2: Vector Index-Optimized Performance (With BHIVE Index)\n", + "print(\"\\n\" + \"=\"*80)\n", + "print(\"PHASE 2: GSI-OPTIMIZED PERFORMANCE (WITH BHIVE INDEX)\")\n", + "print(\"=\"*80)\n", + "\n", + "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + "\n", + "try:\n", + " # Perform the semantic search\n", + " start_time = time.time()\n", + " search_results = vector_store.similarity_search_with_score(query, k=10)\n", + " gsi_time = time.time() - start_time\n", + "\n", + " logging.info(f\"Semantic search completed in {gsi_time:.2f} seconds\")\n", + "\n", + " # Display search results\n", + " print(f\"\\nSemantic Search Results (completed in {gsi_time:.2f} seconds):\")\n", + " print(\"-\" * 80) # Add separator line\n", + " for doc, distance in search_results:\n", + " print(f\"Vector Distance: {distance:.4f}, Text: {doc.page_content}\")\n", + " print(\"-\" * 80) # Add separator between results\n", + "\n", + "except CouchbaseException as e:\n", + " raise RuntimeError(f\"Error performing semantic search: {str(e)}\")\n", + "except Exception as e:\n", + " raise RuntimeError(f\"Unexpected error: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Performance Analysis Summary\n", + "\n", + "Let's analyze the performance improvements we've achieved through different optimization strategies:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "================================================================================\n", + "VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", + "================================================================================\n", + "\n", + "\ud83d\udcca Performance Comparison:\n", + "Optimization Level Time (seconds) Status\n", + "--------------------------------------------------------------------------------\n", + "Phase 1 - Baseline (No Index) 1.3410 \u26aa Baseline\n", + "Phase 2 - Vector Index-Optimized (BHIVE) 0.4157 \u2705 Optimized\n", + "\n", + "\u2728 GSI Performance Gain: 3.23x faster (69.0% improvement)\n", + "\n", + "--------------------------------------------------------------------------------\n", + "KEY INSIGHTS:\n", + "--------------------------------------------------------------------------------\n", + "1. \ud83d\ude80 GSI Optimization:\n", + " \u2022 BHIVE indexes excel with large-scale datasets (millions+ vectors)\n", + " \u2022 Performance gains increase with dataset size and concurrent queries\n", + " \u2022 Optimal for production workloads with sustained traffic patterns\n", + "\n", + "2. \ud83d\udce6 Dataset Size Impact:\n", + " \u2022 Current dataset: ~1,700 articles\n", + " \u2022 At this scale, performance differences may be minimal or variable\n", + " \u2022 Significant gains typically seen with 10M+ vectors\n", + "\n", + "3. \ud83c\udfaf When to Use GSI:\n", + " \u2022 Large-scale vector search applications\n", + " \u2022 High query-per-second (QPS) requirements\n", + " \u2022 Multi-user concurrent access scenarios\n", + " \u2022 Production environments requiring scalability\n", + "\n", + "================================================================================\n" + ] + } + ], + "source": [ + "print(\"\\n\" + \"=\"*80)\n", + "print(\"VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\")\n", + "print(\"=\"*80)\n", + "\n", + "print(f\"\\n\ud83d\udcca Performance Comparison:\")\n", + "print(f\"{'Optimization Level':<35} {'Time (seconds)':<20} {'Status'}\")\n", + "print(\"-\" * 80)\n", + "print(f\"{'Phase 1 - Baseline (No Index)':<35} {baseline_time:.4f}{'':16} \u26aa Baseline\")\n", + "print(f\"{'Phase 2 - Vector Index-Optimized (BHIVE)':<35} {gsi_time:.4f}{'':16} \u2705 Optimized\")\n", + "\n", + "# Calculate improvement\n", + "if baseline_time > gsi_time:\n", + " speedup = baseline_time / gsi_time\n", + " improvement = ((baseline_time - gsi_time) / baseline_time) * 100\n", + " print(f\"\\n\u2728 GSI Performance Gain: {speedup:.2f}x faster ({improvement:.1f}% improvement)\")\n", + "elif gsi_time > baseline_time:\n", + " slowdown_pct = ((gsi_time - baseline_time) / baseline_time) * 100\n", + " print(f\"\\n\u26a0\ufe0f Note: GSI was {slowdown_pct:.1f}% slower than baseline in this run\")\n", + " print(f\" This can happen with small datasets. GSI benefits emerge with scale.\")\n", + "else:\n", + " print(f\"\\n\u2696\ufe0f Performance: Comparable to baseline\")\n", + "\n", + "print(\"\\n\" + \"-\"*80)\n", + "print(\"KEY INSIGHTS:\")\n", + "print(\"-\"*80)\n", + "print(\"1. \ud83d\ude80 GSI Optimization:\")\n", + "print(\" \u2022 BHIVE indexes excel with large-scale datasets (millions+ vectors)\")\n", + "print(\" \u2022 Performance gains increase with dataset size and concurrent queries\")\n", + "print(\" \u2022 Optimal for production workloads with sustained traffic patterns\")\n", + "\n", + "print(\"\\n2. \ud83d\udce6 Dataset Size Impact:\")\n", + "print(f\" \u2022 Current dataset: ~1,700 articles\")\n", + "print(\" \u2022 At this scale, performance differences may be minimal or variable\")\n", + "print(\" \u2022 Significant gains typically seen with 10M+ vectors\")\n", + "\n", + "print(\"\\n3. \ud83c\udfaf When to Use GSI:\")\n", + "print(\" \u2022 Large-scale vector search applications\")\n", + "print(\" \u2022 High query-per-second (QPS) requirements\")\n", + "print(\" \u2022 Multi-user concurrent access scenarios\")\n", + "print(\" \u2022 Production environments requiring scalability\")\n", + "\n", + "print(\"\\n\" + \"=\"*80)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## smolagents: An Introduction\n", + "[smolagents](https://huggingface.co/docs/smolagents/en/index) is a agentic framework by Hugging Face for easy creation of agents in a few lines of code.\n", + "\n", + "Some of the features of smolagents are:\n", + "\n", + "- \u2728 Simplicity: the logic for agents fits in ~1,000 lines of code (see agents.py). We kept abstractions to their minimal shape above raw code!\n", + "\n", + "- \ud83e\uddd1\u200d\ud83d\udcbb First-class support for Code Agents. Our CodeAgent writes its actions in code (as opposed to \"agents being used to write code\"). To make it secure, we support executing in sandboxed environments via E2B.\n", + "\n", + "- \ud83e\udd17 Hub integrations: you can share/pull tools to/from the Hub, and more is to come!\n", + "\n", + "- \ud83c\udf10 Model-agnostic: smolagents supports any LLM. It can be a local transformers or ollama model, one of many providers on the Hub, or any model from OpenAI, Anthropic and many others via our LiteLLM integration.\n", + "\n", + "- \ud83d\udc41\ufe0f Modality-agnostic: Agents support text, vision, video, even audio inputs! Cf this tutorial for vision.\n", + "\n", + "- \ud83d\udee0\ufe0f Tool-agnostic: you can use tools from LangChain, Anthropic's MCP, you can even use a Hub Space as a tool.\n", + "\n", + "# Building a RAG Agent using smolagents\n", + "\n", + "smolagents allows users to define their own tools for the agent to use. These tools can be of two types:\n", + "1. Tools defined as classes: These tools are subclassed from the `Tool` class and must override the `forward` method, which is called when the tool is used.\n", + "2. Tools defined as functions: These are simple functions that are called when the tool is used, and are decorated with the `@tool` decorator.\n", + "\n", + "In our case, we will use the first method, and we define our `RetrieverTool` below. We define a name, a description and a dictionary of inputs that the tool accepts. This helps the LLM properly identify and use the tool.\n", + "\n", + "The `RetrieverTool` is simple: it takes a query generated by the user, and uses Couchbase's performant vector search service under the hood to search for semantically similar documents to the query. The LLM can then use this context to answer the user's question.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "class RetrieverTool(Tool):\n", + " name = \"retriever\"\n", + " description = \"Uses semantic search to retrieve the parts of news documentation that could be most relevant to answer your query.\"\n", + " inputs = {\n", + " \"query\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.\",\n", + " }\n", + " }\n", + " output_type = \"string\"\n", + "\n", + " def __init__(self, vector_store: CouchbaseQueryVectorStore, **kwargs):\n", + " super().__init__(**kwargs)\n", + " self.vector_store = vector_store\n", + "\n", + " def forward(self, query: str) -> str:\n", + " assert isinstance(query, str), \"Query must be a string\"\n", + "\n", + " docs = self.vector_store.similarity_search_with_score(query, k=5)\n", + " return \"\\n\\n\".join(\n", + " f\"# Documents:\\n{doc.page_content}\"\n", + " for doc, distance in docs\n", + " )\n", + "\n", + "retriever_tool = RetrieverTool(vector_store)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Defining Our Agent\n", + "smolagents have predefined configurations for agents that we can use. We use the `ToolCallingAgent`, which writes its tool calls in a JSON format. Alternatively, there also exists a `CodeAgent`, in which the LLM defines it's functions in code.\n", + "\n", + "The `CodeAgent` is offers benefits in certain challenging scenarios: it can lead to [higher performance in difficult benchmarks](https://huggingface.co/papers/2411.01747) and use [30% fewer steps to solve problems](https://huggingface.co/papers/2402.01030). However, since our use case is just a simple RAG tool, a `ToolCallingAgent` will suffice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "agent = ToolCallingAgent(\n", + " tools=[retriever_tool],\n", + " model=OpenAIServerModel(\n", + " model_id=\"gpt-4o-2024-08-06\",\n", + " api_key=OPENAI_API_KEY,\n", + " ),\n", + " max_steps=4,\n", + " verbosity_level=2\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running our Agent\n", + "We have now finished setting up our vector store and agent! The system is now ready to accept queries.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + "\n", + "agent_output = agent.run(query)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Analyzing the Agent\n", + "When the agent runs, smolagents prints out the steps that the agent takes along with the tools called in each step. In the above tool call, two steps occur:\n", + "\n", + "**Step 1**: First, the agent determines that it requires a tool to be used, and the `retriever` tool is called. The agent also specifies the query parameter for the tool (a string). The tool returns semantically similar documents to the query from Couchbase's vector store.\n", + "\n", + "**Step 2**: Next, the agent determines that the context retrieved from the tool is sufficient to answer the question. It then calls the `final_answer` tool, which is predefined for each agent: this tool is called when the agent returns the final answer to the user. In this step, the LLM answers the user's query from the context retrieved in step 1 and passes it to the `final_answer` tool, at which point the agent's execution ends.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "By following these steps, you'll have a fully functional agentic RAG system that leverages the strengths of Couchbase and smolagents, along with OpenAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively using Hyperscale and Composite Vector Indexes which can significantly improve your RAG performance. Whether you're a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, RAG-driven chat system using smolagents' agent framework.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/smolagents/gsi/frontmatter.md b/smolagents/query_based/frontmatter.md similarity index 100% rename from smolagents/gsi/frontmatter.md rename to smolagents/query_based/frontmatter.md diff --git a/smolagents/fts/.env.sample b/smolagents/search_based/.env.sample similarity index 100% rename from smolagents/fts/.env.sample rename to smolagents/search_based/.env.sample diff --git a/smolagents/search_based/RAG_with_Couchbase_and_SmolAgents.ipynb b/smolagents/search_based/RAG_with_Couchbase_and_SmolAgents.ipynb new file mode 100644 index 00000000..952416ad --- /dev/null +++ b/smolagents/search_based/RAG_with_Couchbase_and_SmolAgents.ipynb @@ -0,0 +1,1002 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "kNdImxzypDlm" + }, + "source": [ + "# Introduction\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [OpenAI](https://openai.com) as the embedding and LLM provider, and [Hugging Face smolagents](https://huggingface.co/docs/smolagents/en/index) as an agent framework. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. For guidance on choosing the right vector index for your use case, see the [Couchbase documentation](https://docs.couchbase.com/server/current/vector-search/choose-the-right-vector-index.html). Alternatively if you want to perform semantic search using Couchbase Hyperscale or Composite Vector Indexes, please take a look at [this.](https://developer.couchbase.com/tutorial-smolagents-couchbase-rag-with-hyperscale-or-composite-vector-index)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively.\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "## Get Credentials for OpenAI\n", + "Please follow the [instructions](https://platform.openai.com/docs/quickstart) to generate the OpenAI credentials.\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with an environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To learn more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the required bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NH2o6pqa69oG" + }, + "source": [ + "# Setting the Stage: Installing Necessary Libraries\n", + "To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and OpenAI provides advanced AI models for generating embeddings and understanding natural language. By setting up these libraries, we ensure our environment is equipped to handle the data-intensive and computationally complex tasks required for semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DYhPj0Ta8l_A" + }, + "outputs": [], + "source": [ + "%pip install --quiet -U datasets==3.5.0 langchain-couchbase==0.3.0 langchain-openai==0.3.13 python-dotenv==1.1.0 smolagents==1.13.0 ipywidgets==8.1.6" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1pp7GtNg8mB9" + }, + "source": [ + "# Importing Necessary Libraries\n", + "The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "8GzS6tfL8mFP" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import json\n", + "import logging\n", + "import os\n", + "import time\n", + "from datetime import timedelta\n", + "\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.exceptions import (InternalServerFailureException,\n", + " ServiceUnavailableException,\n", + " QueryIndexAlreadyExistsException)\n", + "from couchbase.management.buckets import CreateBucketSettings\n", + "from couchbase.management.search import SearchIndex\n", + "from couchbase.options import ClusterOptions\n", + "from datasets import load_dataset\n", + "from dotenv import load_dotenv\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "from smolagents import Tool, OpenAIServerModel, ToolCallingAgent" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pBnMp5vb8mIb" + }, + "source": [ + "# Setup Logging\n", + "Logging is configured to track the progress of the script and capture any errors or warnings. This is crucial for debugging and understanding the flow of execution. The logging output includes timestamps, log levels (e.g., INFO, ERROR), and messages that describe what is happening in the script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "Yv8kWcuf8mLx" + }, + "outputs": [], + "source": [ + "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K9G5a0en8mPA" + }, + "source": [ + "# Loading Sensitive Information\n", + "In this section, we prompt the user to input essential configuration settings needed. These settings include sensitive information like API keys, database credentials, and specific configuration names. Instead of hardcoding these details into the script, we request the user to provide them at runtime, ensuring flexibility and security.\n", + "\n", + "The script also validates that all required inputs are provided, raising an error if any crucial information is missing. This approach ensures that your integration is both secure and correctly configured without hardcoding sensitive information, enhancing the overall security and maintainability of your code." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "PFGyHll18mSe" + }, + "outputs": [], + "source": [ + "load_dotenv()\n", + "\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API Key: ')\n", + "\n", + "CB_HOST = os.getenv('CB_HOST') or input('Enter your Couchbase host (default: couchbase://localhost): ') or 'couchbase://localhost'\n", + "CB_USERNAME = os.getenv('CB_USERNAME') or input('Enter your Couchbase username (default: Administrator): ') or 'Administrator'\n", + "CB_PASSWORD = os.getenv('CB_PASSWORD') or getpass.getpass('Enter your Couchbase password (default: password): ') or 'password'\n", + "CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME') or input('Enter your Couchbase bucket name (default: vector-search-testing): ') or 'vector-search-testing'\n", + "INDEX_NAME = os.getenv('INDEX_NAME') or input('Enter your index name (default: vector_search_smolagents): ') or 'vector_search_smolagents'\n", + "SCOPE_NAME = os.getenv('SCOPE_NAME') or input('Enter your scope name (default: shared): ') or 'shared'\n", + "COLLECTION_NAME = os.getenv('COLLECTION_NAME') or input('Enter your collection name (default: smolagents): ') or 'smolagents'\n", + "\n", + "# Check if the variables are correctly loaded\n", + "if not OPENAI_API_KEY:\n", + " raise ValueError(\"Missing OpenAI API Key\")\n", + "\n", + "if 'OPENAI_API_KEY' not in os.environ:\n", + " os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtGrYzUY8mV3" + }, + "source": [ + "# Connecting to the Couchbase Cluster\n", + "Connecting to a Couchbase cluster is the foundation of our project. Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our semantic search engine. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections. This connection is the gateway through which all data will flow, so ensuring it's set up correctly is paramount.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "Zb3kK-7W8mZK" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:30:17,515 - INFO - Successfully connected to Couchbase\n" + ] + } + ], + "source": [ + "try:\n", + " auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)\n", + " options = ClusterOptions(auth)\n", + " cluster = Cluster(CB_HOST, options)\n", + " cluster.wait_until_ready(timedelta(seconds=5))\n", + " logging.info(\"Successfully connected to Couchbase\")\n", + "except Exception as e:\n", + " raise ConnectionError(f\"Failed to connect to Couchbase: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_Gpy32N8mcZ" + }, + "source": [ + "# Setting Up Collections in Couchbase\n", + "The setup_collection() function handles creating and configuring the hierarchical data organization in Couchbase:\n", + "\n", + "1. Bucket Creation:\n", + " - Checks if specified bucket exists, creates it if not\n", + " - Sets bucket properties like RAM quota (1024MB) and replication (disabled)\n", + " - Note: You will not be able to create a bucket on Capella\n", + "2. Scope Management:\n", + " - Verifies if requested scope exists within bucket\n", + " - Creates new scope if needed (unless it's the default \"_default\" scope)\n", + "3. Collection Setup:\n", + " - Checks for collection existence within scope\n", + " - Creates collection if it doesn't exist\n", + " - Waits 2 seconds for collection to be ready\n", + "\n", + "Additional Tasks:\n", + "\n", + "- Creates primary index on collection for query performance\n", + "- Clears any existing documents for clean state\n", + "- Implements comprehensive error handling and logging\n", + "\n", + "The function is called twice to set up:\n", + "\n", + "1. Main collection for vector embeddings\n", + "2. Cache collection for storing results" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "ACZcwUnG8mf2" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:30:20,855 - INFO - Bucket 'vector-search-testing' exists.\n", + "2025-02-28 10:30:21,350 - INFO - Collection 'smolagents' does not exist. Creating it...\n", + "2025-02-28 10:30:21,619 - INFO - Collection 'smolagents' created successfully.\n", + "2025-02-28 10:30:26,886 - INFO - Primary index present or created successfully.\n", + "2025-02-28 10:30:26,938 - INFO - All documents cleared from the collection.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def setup_collection(cluster, bucket_name, scope_name, collection_name):\n", + " try:\n", + " # Check if bucket exists, create if it doesn't\n", + " try:\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' exists.\")\n", + " except Exception as e:\n", + " logging.info(f\"Bucket '{bucket_name}' does not exist. Creating it...\")\n", + " bucket_settings = CreateBucketSettings(\n", + " name=bucket_name,\n", + " bucket_type='couchbase',\n", + " ram_quota_mb=1024,\n", + " flush_enabled=True,\n", + " num_replicas=0\n", + " )\n", + " cluster.buckets().create_bucket(bucket_settings)\n", + " time.sleep(2) # Wait for bucket creation to complete and become available\n", + " bucket = cluster.bucket(bucket_name)\n", + " logging.info(f\"Bucket '{bucket_name}' created successfully.\")\n", + "\n", + " bucket_manager = bucket.collections()\n", + "\n", + " # Check if scope exists, create if it doesn't\n", + " scopes = bucket_manager.get_all_scopes()\n", + " scope_exists = any(scope.name == scope_name for scope in scopes)\n", + " \n", + " if not scope_exists and scope_name != \"_default\":\n", + " logging.info(f\"Scope '{scope_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_scope(scope_name)\n", + " logging.info(f\"Scope '{scope_name}' created successfully.\")\n", + "\n", + " # Check if collection exists, create if it doesn't\n", + " collections = bucket_manager.get_all_scopes()\n", + " collection_exists = any(\n", + " scope.name == scope_name and collection_name in [col.name for col in scope.collections]\n", + " for scope in collections\n", + " )\n", + "\n", + " if not collection_exists:\n", + " logging.info(f\"Collection '{collection_name}' does not exist. Creating it...\")\n", + " bucket_manager.create_collection(scope_name, collection_name)\n", + " logging.info(f\"Collection '{collection_name}' created successfully.\")\n", + " else:\n", + " logging.info(f\"Collection '{collection_name}' already exists. Skipping creation.\")\n", + "\n", + " # Wait for collection to be ready\n", + " collection = bucket.scope(scope_name).collection(collection_name)\n", + " time.sleep(2) # Give the collection time to be ready for queries\n", + "\n", + " # Ensure primary index exists\n", + " try:\n", + " cluster.query(f\"CREATE PRIMARY INDEX IF NOT EXISTS ON `{bucket_name}`.`{scope_name}`.`{collection_name}`\").execute()\n", + " logging.info(\"Primary index present or created successfully.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error creating primary index: {str(e)}\")\n", + "\n", + " # Clear all documents in the collection\n", + " try:\n", + " query = f\"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`\"\n", + " cluster.query(query).execute()\n", + " logging.info(\"All documents cleared from the collection.\")\n", + " except Exception as e:\n", + " logging.warning(f\"Error while clearing documents: {str(e)}. The collection might be empty.\")\n", + "\n", + " return collection\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error setting up collection: {str(e)}\")\n", + " \n", + "setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NMJ7RRYp8mjV" + }, + "source": [ + "# Loading Couchbase Vector Search Index\n", + "\n", + "Semantic search requires an efficient way to retrieve relevant documents based on a user's query. This is where the Couchbase **Vector Search Index** comes into play. In this step, we load the Vector Search Index definition from a JSON file, which specifies how the index should be structured. This includes the fields to be indexed, the dimensions of the vectors, and other parameters that determine how the search engine processes queries based on vector similarity.\n", + "\n", + "This vector search index configuration requires specific default settings to function properly. This tutorial uses the bucket named `vector-search-testing` with the scope `shared` and collection `smolagents`. The configuration is set up for vectors with exactly `1536 dimensions`, using dot product similarity and optimized for recall. If you want to use a different bucket, scope, or collection, you will need to modify the index configuration accordingly.\n", + "\n", + "For more information on creating a vector search index, please follow the [instructions](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "y7xiCrOc8mmj" + }, + "outputs": [], + "source": [ + "# If you are running this script locally (not in Google Colab), uncomment the following line\n", + "# and provide the path to your index definition file.\n", + "\n", + "# index_definition_path = '/path_to_your_index_file/smolagents_index.json' # Local setup: specify your file path here\n", + "\n", + "# # Version for Google Colab\n", + "# def load_index_definition_colab():\n", + "# from google.colab import files\n", + "# print(\"Upload your index definition file\")\n", + "# uploaded = files.upload()\n", + "# index_definition_path = list(uploaded.keys())[0]\n", + "\n", + "# try:\n", + "# with open(index_definition_path, 'r') as file:\n", + "# index_definition = json.load(file)\n", + "# return index_definition\n", + "# except Exception as e:\n", + "# raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Version for Local Environment\n", + "def load_index_definition_local(index_definition_path):\n", + " try:\n", + " with open(index_definition_path, 'r') as file:\n", + " index_definition = json.load(file)\n", + " return index_definition\n", + " except Exception as e:\n", + " raise ValueError(f\"Error loading index definition from {index_definition_path}: {str(e)}\")\n", + "\n", + "# Usage\n", + "# Uncomment the appropriate line based on your environment\n", + "# index_definition = load_index_definition_colab()\n", + "index_definition = load_index_definition_local('smolagents_index.json')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v_ddPQ_Y8mpm" + }, + "source": [ + "# Creating or Updating Search Indexes\n", + "\n", + "With the index definition loaded, the next step is to create or update the **Vector Search Index** in Couchbase. This step is crucial because it optimizes our database for vector similarity search operations, allowing us to perform searches based on the semantic content of documents rather than just keywords. By creating or updating a Vector Search Index, we enable our search engine to handle complex queries that involve finding semantically similar documents using vector embeddings, which is essential for a robust semantic search engine." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "bHEpUu1l8msx" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:30:32,890 - INFO - Creating new index 'vector-search-testing.shared.vector_search_smolagents'...\n", + "2025-02-28 10:30:33,058 - INFO - Index 'vector-search-testing.shared.vector_search_smolagents' successfully created/updated.\n" + ] + } + ], + "source": [ + "try:\n", + " scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()\n", + "\n", + " # Check if index already exists\n", + " existing_indexes = scope_index_manager.get_all_indexes()\n", + " index_name = index_definition[\"name\"]\n", + "\n", + " if index_name in [index.name for index in existing_indexes]:\n", + " logging.info(f\"Index '{index_name}' found\")\n", + " else:\n", + " logging.info(f\"Creating new index '{index_name}'...\")\n", + "\n", + " # Create SearchIndex object from JSON definition\n", + " search_index = SearchIndex.from_json(index_definition)\n", + "\n", + " # Upsert the index (create if not exists, update if exists)\n", + " scope_index_manager.upsert_index(search_index)\n", + " logging.info(f\"Index '{index_name}' successfully created/updated.\")\n", + "\n", + "except QueryIndexAlreadyExistsException:\n", + " logging.info(f\"Index '{index_name}' already exists. Skipping creation/update.\")\n", + "except ServiceUnavailableException:\n", + " raise RuntimeError(\"Search service is not available. Please ensure the Search service is enabled in your Couchbase cluster.\")\n", + "except InternalServerFailureException as e:\n", + " logging.error(f\"Internal server error: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7FvxRsg38m3G" + }, + "source": [ + "# Creating OpenAI Embeddings\n", + "Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using OpenAI, we equip our search engine with the ability to understand and process natural language in a way that's much closer to how humans understand language. This step transforms our raw text data into a format that the search engine can use to find and rank relevant documents." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "_75ZyCRh8m6m" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:30:36,983 - INFO - Successfully created OpenAIEmbeddings\n" + ] + } + ], + "source": [ + "try:\n", + " embeddings = OpenAIEmbeddings(\n", + " model=\"text-embedding-3-small\",\n", + " api_key=OPENAI_API_KEY,\n", + " )\n", + " logging.info(\"Successfully created OpenAIEmbeddings\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error creating OpenAIEmbeddings: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IwZMUnF8m-N" + }, + "source": [ + "# Setting Up the Couchbase Vector Store\n", + "A vector store is where we'll keep our embeddings. Unlike the FTS index, which is used for text-based search, the vector store is specifically designed to handle embeddings and perform similarity searches. When a user inputs a query, the search engine converts the query into an embedding and compares it against the embeddings stored in the vector store. This allows the engine to find documents that are semantically similar to the query, even if they don't contain the exact same words. By setting up the vector store in Couchbase, we create a powerful tool that enables our search engine to understand and retrieve information based on the meaning and context of the query, rather than just the specific words used." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "DwIJQjYT9RV_" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:30:40,503 - INFO - Successfully created vector store\n" + ] + } + ], + "source": [ + "try:\n", + " vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=CB_BUCKET_NAME,\n", + " scope_name=SCOPE_NAME,\n", + " collection_name=COLLECTION_NAME,\n", + " embedding=embeddings,\n", + " index_name=INDEX_NAME,\n", + " )\n", + " logging.info(\"Successfully created vector store\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to create vector store: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the BBC News Dataset\n", + "To build a search engine, we need data to search through. We use the BBC News dataset from RealTimeData, which provides real-world news articles. This dataset contains news articles from BBC covering various topics and time periods. Loading the dataset is a crucial step because it provides the raw material that our search engine will work with. The quality and diversity of the news articles make it an excellent choice for testing and refining our search engine, ensuring it can handle real-world news content effectively.\n", + "\n", + "The BBC News dataset allows us to work with authentic news articles, enabling us to build and test a search engine that can effectively process and retrieve relevant news content. The dataset is loaded using the Hugging Face datasets library, specifically accessing the \"RealTimeData/bbc_news_alltime\" dataset with the \"2024-12\" version." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:30:51,981 - INFO - Successfully loaded the BBC News dataset with 2687 rows.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], + "source": [ + "try:\n", + " news_dataset = load_dataset(\n", + " \"RealTimeData/bbc_news_alltime\", \"2024-12\", split=\"train\"\n", + " )\n", + " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", + " logging.info(f\"Successfully loaded the BBC News dataset with {len(news_dataset)} rows.\")\n", + "except Exception as e:\n", + " raise ValueError(f\"Error loading the BBC News dataset: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleaning up the Data\n", + "We will use the content of the news articles for our RAG system.\n", + "\n", + "The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], + "source": [ + "news_articles = news_dataset[\"content\"]\n", + "unique_articles = set()\n", + "for article in news_articles:\n", + " if article:\n", + " unique_articles.add(article)\n", + "unique_news_articles = list(unique_articles)\n", + "print(f\"We have {len(unique_news_articles)} unique articles in our database.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Data to the Vector Store\n", + "To efficiently handle the large number of articles, we process them in batches of articles at a time. This batch processing approach helps manage memory usage and provides better control over the ingestion process.\n", + "\n", + "We first filter out any articles that exceed 50,000 characters to avoid potential issues with token limits. Then, using the vector store's add_texts method, we add the filtered articles to our vector database. The batch_size parameter controls how many articles are processed in each iteration.\n", + "\n", + "This approach offers several benefits:\n", + "\n", + "1. Memory Efficiency: Processing in smaller batches prevents memory overload\n", + "2. Error Handling: If an error occurs, only the current batch is affected\n", + "3. Progress Tracking: Easier to monitor and track the ingestion progress\n", + "4. Resource Management: Better control over CPU and network resource utilization\n", + "\n", + "We use a conservative batch size of 100 to ensure reliable operation. The optimal batch size depends on many factors including:\n", + "\n", + "- Document sizes being inserted\n", + "- Available system resources\n", + "- Network conditions\n", + "- Concurrent workload\n", + "\n", + "Consider measuring performance with your specific workload before adjusting." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# Save the current logging level\n", + "current_logging_level = logging.getLogger().getEffectiveLevel()\n", + "\n", + "# # Set logging level to CRITICAL to suppress lower level logs\n", + "logging.getLogger().setLevel(logging.CRITICAL)\n", + "\n", + "articles = [article for article in unique_news_articles if article and len(article) <= 50000]\n", + "\n", + "try:\n", + " vector_store.add_texts(\n", + " texts=articles,\n", + " batch_size=100\n", + " )\n", + "except Exception as e:\n", + " raise ValueError(f\"Failed to save documents to vector store: {str(e)}\")\n", + "\n", + "# Restore the original logging level\n", + "logging.getLogger().setLevel(current_logging_level)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# smolagents: An Introduction\n", + "[smolagents](https://huggingface.co/docs/smolagents/en/index) is a agentic framework by Hugging Face for easy creation of agents in a few lines of code.\n", + "\n", + "Some of the features of smolagents are:\n", + "\n", + "- \u2728 Simplicity: the logic for agents fits in ~1,000 lines of code (see agents.py). We kept abstractions to their minimal shape above raw code!\n", + "\n", + "- \ud83e\uddd1\u200d\ud83d\udcbb First-class support for Code Agents. Our CodeAgent writes its actions in code (as opposed to \"agents being used to write code\"). To make it secure, we support executing in sandboxed environments via E2B.\n", + "\n", + "- \ud83e\udd17 Hub integrations: you can share/pull tools to/from the Hub, and more is to come!\n", + "\n", + "- \ud83c\udf10 Model-agnostic: smolagents supports any LLM. It can be a local transformers or ollama model, one of many providers on the Hub, or any model from OpenAI, Anthropic and many others via our LiteLLM integration.\n", + "\n", + "- \ud83d\udc41\ufe0f Modality-agnostic: Agents support text, vision, video, even audio inputs! Cf this tutorial for vision.\n", + "\n", + "- \ud83d\udee0\ufe0f Tool-agnostic: you can use tools from LangChain, Anthropic's MCP, you can even use a Hub Space as a tool.\n", + "\n", + "# Building a RAG Agent using smolagents\n", + "\n", + "smolagents allows users to define their own tools for the agent to use. These tools can be of two types:\n", + "1. Tools defined as classes: These tools are subclassed from the `Tool` class and must override the `forward` method, which is called when the tool is used.\n", + "2. Tools defined as functions: These are simple functions that are called when the tool is used, and are decorated with the `@tool` decorator.\n", + "\n", + "In our case, we will use the first method, and we define our `RetrieverTool` below. We define a name, a description and a dictionary of inputs that the tool accepts. This helps the LLM properly identify and use the tool.\n", + "\n", + "The `RetrieverTool` is simple: it takes a query generated by the user, and uses Couchbase's performant vector search service under the hood to search for semantically similar documents to the query. The LLM can then use this context to answer the user's question." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "class RetrieverTool(Tool):\n", + " name = \"retriever\"\n", + " description = \"Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query.\"\n", + " inputs = {\n", + " \"query\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.\",\n", + " }\n", + " }\n", + " output_type = \"string\"\n", + "\n", + " def __init__(self, vector_store: CouchbaseSearchVectorStore, **kwargs):\n", + " super().__init__(**kwargs)\n", + " self.vector_store = vector_store\n", + "\n", + " def forward(self, query: str) -> str:\n", + " assert isinstance(query, str), \"Query must be a string\"\n", + "\n", + " docs = self.vector_store.similarity_search_with_score(query, k=5)\n", + " return \"\\n\\n\".join(\n", + " f\"# Documents:\\n{doc.page_content}\"\n", + " for doc, score in docs\n", + " )\n", + "\n", + "retriever_tool = RetrieverTool(vector_store)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Defining Our Agent\n", + "smolagents have predefined configurations for agents that we can use. We use the `ToolCallingAgent`, which writes its tool calls in a JSON format. Alternatively, there also exists a `CodeAgent`, in which the LLM defines it's functions in code.\n", + "\n", + "The `CodeAgent` is offers benefits in certain challenging scenarios: it can lead to [higher performance in difficult benchmarks](https://huggingface.co/papers/2411.01747) and use [30% fewer steps to solve problems](https://huggingface.co/papers/2402.01030). However, since our use case is just a simple RAG tool, a `ToolCallingAgent` will suffice." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "agent = ToolCallingAgent(\n", + " tools=[retriever_tool],\n", + " model=OpenAIServerModel(\n", + " model_id=\"gpt-4o-2024-08-06\",\n", + " api_key=OPENAI_API_KEY,\n", + " ),\n", + " max_steps=4,\n", + " verbosity_level=2\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Running our Agent\n", + "We have now finished setting up our vector store and agent! The system is now ready to accept queries." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 New run \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2502 What was manchester city manager pep guardiola's reaction to the team's current form?                           \u2502\n",
+       "\u2502                                                                                                                 \u2502\n",
+       "\u2570\u2500 OpenAIServerModel - gpt-4o-2024-08-06 \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[38;2;212;183;2m\u256d\u2500\u001b[0m\u001b[38;2;212;183;2m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[38;2;212;183;2m \u001b[0m\u001b[1;38;2;212;183;2mNew run\u001b[0m\u001b[38;2;212;183;2m \u001b[0m\u001b[38;2;212;183;2m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[38;2;212;183;2m\u2500\u256e\u001b[0m\n", + "\u001b[38;2;212;183;2m\u2502\u001b[0m \u001b[38;2;212;183;2m\u2502\u001b[0m\n", + "\u001b[38;2;212;183;2m\u2502\u001b[0m \u001b[1mWhat was manchester city manager pep guardiola's reaction to the team's current form?\u001b[0m \u001b[38;2;212;183;2m\u2502\u001b[0m\n", + "\u001b[38;2;212;183;2m\u2502\u001b[0m \u001b[38;2;212;183;2m\u2502\u001b[0m\n", + "\u001b[38;2;212;183;2m\u2570\u2500\u001b[0m\u001b[38;2;212;183;2m OpenAIServerModel - gpt-4o-2024-08-06 \u001b[0m\u001b[38;2;212;183;2m\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u001b[0m\u001b[38;2;212;183;2m\u2500\u256f\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 Step 1 \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[38;2;212;183;2m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 \u001b[0m\u001b[1mStep \u001b[0m\u001b[1;36m1\u001b[0m\u001b[38;2;212;183;2m \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:32:28,032 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502 Calling tool: 'retriever' with arguments: {'query': \"Pep Guardiola's reaction to Manchester City's current      \u2502\n",
+       "\u2502 form\"}                                                                                                          \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n", + "\u2502 Calling tool: 'retriever' with arguments: {'query': \"Pep Guardiola's reaction to Manchester City's current \u2502\n", + "\u2502 form\"} \u2502\n", + "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:32:28,466 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "data": { + "text/html": [ + "
[Step 0: Duration 2.25 seconds| Input tokens: 1,010 | Output tokens: 23]\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2m[Step 0: Duration 2.25 seconds| Input tokens: 1,010 | Output tokens: 23]\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 Step 2 \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[38;2;212;183;2m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 \u001b[0m\u001b[1mStep \u001b[0m\u001b[1;36m2\u001b[0m\u001b[38;2;212;183;2m \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-02-28 10:32:31,724 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "data": { + "text/html": [ + "
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n",
+       "\u2502 Calling tool: 'final_answer' with arguments: {'answer': 'Manchester City manager Pep Guardiola has expressed a  \u2502\n",
+       "\u2502 mix of concern and determination regarding the team\\'s current form. Guardiola admitted that this is the worst  \u2502\n",
+       "\u2502 run of results in his managerial career and that it has affected his sleep and diet. He described his state of  \u2502\n",
+       "\u2502 mind as \"ugly\" and acknowledged that City needs to defend better and avoid making mistakes. Despite his         \u2502\n",
+       "\u2502 personal challenges, Guardiola stated that he is \"fine\" and focused on finding solutions.\\n\\nGuardiola also     \u2502\n",
+       "\u2502 took responsibility for the team\\'s struggles, stating he is \"not good enough\" and has to find solutions. He    \u2502\n",
+       "\u2502 expressed self-doubt but is striving to improve the team\\'s situation step by step. Guardiola has faced         \u2502\n",
+       "\u2502 criticism due to the team\\'s poor form, which has seen them lose several matches and fall behind in the title   \u2502\n",
+       "\u2502 race.\\n\\nHe emphasized the need to restore their defensive strength and regain confidence in their play.        \u2502\n",
+       "\u2502 Guardiola is planning a significant rebuild of the squad to address these challenges, aiming to replace several \u2502\n",
+       "\u2502 regular starters and emphasize improvements in the team\\'s intensity and defensive concepts.'}                  \u2502\n",
+       "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n",
+       "
\n" + ], + "text/plain": [ + "\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n", + "\u2502 Calling tool: 'final_answer' with arguments: {'answer': 'Manchester City manager Pep Guardiola has expressed a \u2502\n", + "\u2502 mix of concern and determination regarding the team\\'s current form. Guardiola admitted that this is the worst \u2502\n", + "\u2502 run of results in his managerial career and that it has affected his sleep and diet. He described his state of \u2502\n", + "\u2502 mind as \"ugly\" and acknowledged that City needs to defend better and avoid making mistakes. Despite his \u2502\n", + "\u2502 personal challenges, Guardiola stated that he is \"fine\" and focused on finding solutions.\\n\\nGuardiola also \u2502\n", + "\u2502 took responsibility for the team\\'s struggles, stating he is \"not good enough\" and has to find solutions. He \u2502\n", + "\u2502 expressed self-doubt but is striving to improve the team\\'s situation step by step. Guardiola has faced \u2502\n", + "\u2502 criticism due to the team\\'s poor form, which has seen them lose several matches and fall behind in the title \u2502\n", + "\u2502 race.\\n\\nHe emphasized the need to restore their defensive strength and regain confidence in their play. \u2502\n", + "\u2502 Guardiola is planning a significant rebuild of the squad to address these challenges, aiming to replace several \u2502\n", + "\u2502 regular starters and emphasize improvements in the team\\'s intensity and defensive concepts.'} \u2502\n", + "\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
Final answer: Manchester City manager Pep Guardiola has expressed a mix of concern and determination regarding the \n",
+       "team's current form. Guardiola admitted that this is the worst run of results in his managerial career and that it \n",
+       "has affected his sleep and diet. He described his state of mind as \"ugly\" and acknowledged that City needs to \n",
+       "defend better and avoid making mistakes. Despite his personal challenges, Guardiola stated that he is \"fine\" and \n",
+       "focused on finding solutions.\n",
+       "\n",
+       "Guardiola also took responsibility for the team's struggles, stating he is \"not good enough\" and has to find \n",
+       "solutions. He expressed self-doubt but is striving to improve the team's situation step by step. Guardiola has \n",
+       "faced criticism due to the team's poor form, which has seen them lose several matches and fall behind in the title \n",
+       "race.\n",
+       "\n",
+       "He emphasized the need to restore their defensive strength and regain confidence in their play. Guardiola is \n",
+       "planning a significant rebuild of the squad to address these challenges, aiming to replace several regular starters\n",
+       "and emphasize improvements in the team's intensity and defensive concepts.\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[1;38;2;212;183;2mFinal answer: Manchester City manager Pep Guardiola has expressed a mix of concern and determination regarding the \u001b[0m\n", + "\u001b[1;38;2;212;183;2mteam's current form. Guardiola admitted that this is the worst run of results in his managerial career and that it \u001b[0m\n", + "\u001b[1;38;2;212;183;2mhas affected his sleep and diet. He described his state of mind as \"ugly\" and acknowledged that City needs to \u001b[0m\n", + "\u001b[1;38;2;212;183;2mdefend better and avoid making mistakes. Despite his personal challenges, Guardiola stated that he is \"fine\" and \u001b[0m\n", + "\u001b[1;38;2;212;183;2mfocused on finding solutions.\u001b[0m\n", + "\n", + "\u001b[1;38;2;212;183;2mGuardiola also took responsibility for the team's struggles, stating he is \"not good enough\" and has to find \u001b[0m\n", + "\u001b[1;38;2;212;183;2msolutions. He expressed self-doubt but is striving to improve the team's situation step by step. Guardiola has \u001b[0m\n", + "\u001b[1;38;2;212;183;2mfaced criticism due to the team's poor form, which has seen them lose several matches and fall behind in the title \u001b[0m\n", + "\u001b[1;38;2;212;183;2mrace.\u001b[0m\n", + "\n", + "\u001b[1;38;2;212;183;2mHe emphasized the need to restore their defensive strength and regain confidence in their play. Guardiola is \u001b[0m\n", + "\u001b[1;38;2;212;183;2mplanning a significant rebuild of the squad to address these challenges, aiming to replace several regular starters\u001b[0m\n", + "\u001b[1;38;2;212;183;2mand emphasize improvements in the team's intensity and defensive concepts.\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
[Step 1: Duration 2.74 seconds| Input tokens: 7,162 | Output tokens: 241]\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[2m[Step 1: Duration 2.74 seconds| Input tokens: 7,162 | Output tokens: 241]\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "query = \"What was manchester city manager pep guardiola's reaction to the team's current form?\"\n", + "\n", + "agent_output = agent.run(query)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Analyzing the Agent\n", + "When the agent runs, smolagents prints out the steps that the agent takes along with the tools called in each step. In the above tool call, two steps occur:\n", + "\n", + "**Step 1**: First, the agent determines that it requires a tool to be used, and the `retriever` tool is called. The agent also specifies the query parameter for the tool (a string). The tool returns semantically similar documents to the query from Couchbase's vector store.\n", + "\n", + "**Step 2**: Next, the agent determines that the context retrieved from the tool is sufficient to answer the question. It then calls the `final_answer` tool, which is predefined for each agent: this tool is called when the agent returns the final answer to the user. In this step, the LLM answers the user's query from the context retrieved in step 1 and passes it to the `final_answer` tool, at which point the agent's execution ends." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "By following these steps, you\u2019ll have a fully functional agentic RAG system that leverages the strengths of Couchbase and smolagents, along with OpenAI. This guide is designed not just to show you how to build the system, but also to explain why each step is necessary, giving you a deeper understanding of the principles behind semantic search and how to implement it effectively. Whether you\u2019re a newcomer to software development or an experienced developer looking to expand your skills, this guide will provide you with the knowledge and tools you need to create a powerful, RAG-driven chat system." + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/smolagents/fts/frontmatter.md b/smolagents/search_based/frontmatter.md similarity index 100% rename from smolagents/fts/frontmatter.md rename to smolagents/search_based/frontmatter.md diff --git a/smolagents/fts/smolagents_index.json b/smolagents/search_based/smolagents_index.json similarity index 100% rename from smolagents/fts/smolagents_index.json rename to smolagents/search_based/smolagents_index.json