Updated tutorial to use couchbase hyperscale vector index

giriraj-singh-couchbase · giriraj-singh-couchbase · commit 8cb871561dae · 2025-12-05T02:51:24.000+05:30
diff --git a/autovec-tutorial/__frontmatter__.md b/autovec-tutorial/__frontmatter__.md
@@ -1,10 +1,10 @@
 ---
 # frontmatter
 path: "/tutorial-couchbase-autovectorization-langchain"
-title: Auto-Vectorization with Couchbase Capella AI Services and LangChain
+title: Auto-Vectorization of Strucutured Data with Couchbase Capella AI Services
 short_title: Auto-Vectorization with Couchbase and LangChain
 description:
-  - Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your data into vector embeddings.
+  - Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your structured data into vector embeddings. To learn about the auto-vectorization of unstuctured data read the following ["/tutorial"].
   - This tutorial demonstrates how to set up automated embedding generation workflows and perform semantic search using LangChain.
 content_type: tutorial
 filter: sdk
diff --git a/autovec-tutorial/autovec_langchain.ipynb b/autovec-tutorial/autovec_langchain.ipynb
@@ -144,7 +144,7 @@
    "source": [
     "# 5. Vector Search\n",
     "\n",
-    "The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. These searches are powered by **Couchbase's Search service**.\n",
+    "The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. These searches are powered by **Couchbase's Vector Search service using Hyperscale Indexes**.\n",
     "\n",
     "Before you proceed, make sure the following packages are installed by running:"
    ]
@@ -178,7 +178,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 20,
    "id": "30955126-0053-4cec-9dec-e4c05a8de7c3",
    "metadata": {},
    "outputs": [],
@@ -188,7 +188,8 @@
     "from couchbase.options import ClusterOptions\n",
     "\n",
     "from langchain_openai import OpenAIEmbeddings\n",
-    "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n",
+    "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n",
+    "from langchain_couchbase.vectorstores import DistanceStrategy\n",
     "\n",
     "from datetime import timedelta"
    ]
@@ -204,19 +205,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 3,
    "id": "7e4c9e8d",
    "metadata": {},
    "outputs": [],
    "source": [
-    "endpoint = \"couchbases://cb.f-znsfdbilcp-ja4.sandbox.nonprod-project-avengers.com\"                                              # Replace this with Connection String\n",
-    "username = \"testing\"                                                          # Replace this with your username\n",
-    "password = \"Testing@1\"                                                          # Replace this with your password\n",
+    "endpoint = \"CLUSTER_CONNECT_STRING\"                                                  # Replace this with Connection String\n",
+    "username = \"CLUSTER_USERNAME\"\n",
+    "password = \"CLUSTER_PASSWORD\"                                                      \n",
     "auth = PasswordAuthenticator(username, password)\n",
-    "\n",
-    "options = ClusterOptions(auth, tls_verify='none')\n",
+    "options = ClusterOptions(auth)\n",
     "cluster = Cluster(endpoint, options)\n",
-    "\n",
     "cluster.wait_until_ready(timedelta(seconds=5))"
    ]
   },
@@ -227,7 +226,6 @@
    "source": [
     "# Selection of Buckets / Scope / Collection / Index / Embedder\n",
     "   - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n",
-    "   - `index_name` specifies the **Capella Search index name**. This is the Search index created automatically during the workflow setup (step 4.5) or manually as described in the same step. You can find this index name in the **Search** tab of your Capella cluster.\n",
     "   - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.\n",
     "       - `open_api_key` is the api key token created in `step 3.2 -3`.\n",
     "       - `open_api_base` is the Capella model services endpoint found in the models section.\n",
@@ -238,21 +236,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 21,
    "id": "799b2efc",
    "metadata": {},
    "outputs": [],
    "source": [
     "bucket_name = \"travel-sample\"\n",
     "scope_name = \"inventory\"\n",
     "collection_name = \"hotel\"\n",
-    "index_name = \"hyperscale_autovec_workflow_vec_addr_descr_id\"  # This is the name of the search index that was created in step 4.5 and can also be seen in the search tab of the cluster.\n",
-    "                                                          # It should be noted that hyperscale_workflow_name_index_fieldname is the naming convention for the index created by AutoVectorization workflow where\n",
-    "                                                          # fieldname is the name of the field being indexed.\n",
     "\n",
     "#  Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain\n",
     "embedder = OpenAIEmbeddings(\n",
-    "    model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\",                        # This is the model that will be used to create the embedding of the query.\n",
+    "    model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\",                # This is the model that will be used to create the embedding of the query.\n",
     "    openai_api_key=\"CAPELLA_MODEL_KEY\",\n",
     "    openai_api_base=\"CAPELLA_MODEL_ENDPOINT/v1\",\n",
     "    check_embedding_ctx_length=False,\n",
@@ -266,47 +261,29 @@
    "metadata": {},
    "source": [
     "# VectorStore Construction\n",
-    "   - Creates a [CouchbaseSearchVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) instance that interfaces with **Couchbase's Search service** to perform vector similarity searches.\n",
+    "   - Creates a [CouchbaseQueryVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-query-vector-store) instance that interfaces with **Couchbase's Query service** to perform vector similarity searches using [Hyperscale/Composite](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html) indexes. \n",
     "   - The vector store:\n",
     "     * Knows where to read documents (`bucket/scope/collection`).\n",
-    "     * References the Search index (`index_name`) that contains vector field mappings.\n",
     "     * Knows the embedding field (the vector produced by the Auto-Vectorization workflow).\n",
     "     * Uses the provided embedder to embed queries on-demand for similarity search.\n",
-    "   - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n",
-    "   - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields."
+    "   - You can choose any representative field for `text_key` for which you want to get the data. Over here we have specified the title of the document."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 17,
    "id": "50b85f78",
    "metadata": {},
-   "outputs": [
-    {
-     "ename": "ValueError",
-     "evalue": "Index hyperscale_autovec_workflow_vec_addr_descr_id does not exist.  Please create the index before searching.",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
-      "\u001b[31mValueError\u001b[39m                                Traceback (most recent call last)",
-      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[21]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m vector_store = \u001b[43mCouchbaseSearchVectorStore\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m      2\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcluster\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcluster\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m      3\u001b[39m \u001b[43m    \u001b[49m\u001b[43mbucket_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbucket_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m      4\u001b[39m \u001b[43m    \u001b[49m\u001b[43mscope_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mscope_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m      5\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcollection_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcollection_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m      6\u001b[39m \u001b[43m    \u001b[49m\u001b[43membedding\u001b[49m\u001b[43m=\u001b[49m\u001b[43membedder\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m      7\u001b[39m \u001b[43m    \u001b[49m\u001b[43mindex_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mindex_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m      8\u001b[39m \u001b[43m    \u001b[49m\u001b[43mtext_key\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43maddress\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m                  \u001b[49m\u001b[38;5;66;43;03m# Your document's text field\u001b[39;49;00m\n\u001b[32m      9\u001b[39m \u001b[43m    \u001b[49m\u001b[43membedding_key\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mvec_addr_descr_id\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m    \u001b[49m\u001b[38;5;66;43;03m# This is the field in which your vector (embedding) is stored in the cluster.\u001b[39;49;00m\n\u001b[32m     10\u001b[39m \u001b[43m)\u001b[49m\n",
-      "\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:267\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore.__init__\u001b[39m\u001b[34m(self, cluster, bucket_name, scope_name, collection_name, embedding, index_name, text_key, embedding_key, scoped_index)\u001b[39m\n\u001b[32m    265\u001b[39m     \u001b[38;5;28mself\u001b[39m._check_index_exists()\n\u001b[32m    266\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m--> \u001b[39m\u001b[32m267\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m e\n",
-      "\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:265\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore.__init__\u001b[39m\u001b[34m(self, cluster, bucket_name, scope_name, collection_name, embedding, index_name, text_key, embedding_key, scoped_index)\u001b[39m\n\u001b[32m    263\u001b[39m \u001b[38;5;66;03m# Check if the index exists. Throws ValueError if it doesn't\u001b[39;00m\n\u001b[32m    264\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m265\u001b[39m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_check_index_exists\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    266\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m    267\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m e\n",
-      "\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:192\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore._check_index_exists\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m    188\u001b[39m     all_indexes = [\n\u001b[32m    189\u001b[39m         index.name \u001b[38;5;28;01mfor\u001b[39;00m index \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._scope.search_indexes().get_all_indexes()\n\u001b[32m    190\u001b[39m     ]\n\u001b[32m    191\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._index_name \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m all_indexes:\n\u001b[32m--> \u001b[39m\u001b[32m192\u001b[39m         \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m    193\u001b[39m             \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mIndex \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m._index_name\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m does not exist. \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    194\u001b[39m             \u001b[33m\"\u001b[39m\u001b[33m Please create the index before searching.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    195\u001b[39m         )\n\u001b[32m    196\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m    197\u001b[39m     all_indexes = [\n\u001b[32m    198\u001b[39m         index.name \u001b[38;5;28;01mfor\u001b[39;00m index \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._cluster.search_indexes().get_all_indexes()\n\u001b[32m    199\u001b[39m     ]\n",
-      "\u001b[31mValueError\u001b[39m: Index hyperscale_autovec_workflow_vec_addr_descr_id does not exist.  Please create the index before searching."
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "vector_store = CouchbaseSearchVectorStore(\n",
+    "vector_store = CouchbaseQueryVectorStore(\n",
     "    cluster=cluster,\n",
     "    bucket_name=bucket_name,\n",
     "    scope_name=scope_name,\n",
     "    collection_name=collection_name,\n",
     "    embedding=embedder,\n",
-    "    index_name=index_name,\n",
-    "    text_key=\"address\",                  # Your document's text field\n",
-    "    embedding_key=\"vec_addr_descr_id\"    # This is the field in which your vector (embedding) is stored in the cluster.\n",
+    "    text_key = \"title\",\n",
+    "    distance_metric=DistanceStrategy.DOT\n",
     ")"
    ]
   },
@@ -318,26 +295,32 @@
     "# Performing a Similarity Search\n",
     "   - Defines a natural language query (e.g., \"Woodhead Road\").\n",
     "   - Calls `similarity_search(k=3)` to retrieve the top 3 most semantically similar documents.\n",
-    "   - Prints ranked results, extracting a `title` (if present) and the chosen `text_key` (here `address`).\n",
-    "   - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n",
+    "   - Prints ranked results, extracting a `title` (if present) and the chosen `text_key` (here `title`).\n",
+    "   - Change `query` to any descriptive phrase (e.g., \"airport hotel near NYC\").\n",
     "   - Adjust `k` for more or fewer results."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 18,
    "id": "177fd6d5",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "page_content='Gillingham (Kent)'\n",
+      "page_content='Gillingham (Kent)'\n",
+      "page_content='Giverny'\n"
+     ]
+    }
+   ],
    "source": [
-    "query = \"What hotels are there in USA?\"\n",
+    "query = \"Which hotels have the best Infrastucture and the services of the hotel is very good?\"\n",
     "results = vector_store.similarity_search(query, k=3)\n",
-    "\n",
-    "# Print out the top-k results\n",
-    "for rank, doc in enumerate(results, start=1):\n",
-    "    title = doc.metadata.get(\"title\", \"<no title>\")\n",
-    "    address_text = doc.page_content\n",
-    "    print(f\"{rank}. {title} — Address: {address_text}\")"
+    "for doc in results:\n",
+    "    print(doc)"
    ]
   },
   {
@@ -348,15 +331,10 @@
     "## 6. Results and Interpretation\n",
     "\n",
     "As we can see, 3 (or `k`) ranked results are printed in the output.\n",
-    "\n",
-    "### What Each Part Means\n",
-    "- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n",
-    "- Title: Pulled from `doc.metadata.get(\"title\", \"<no title>\")`. If your documents don't contain a `title` field, you will see `<no title>`.\n",
-    "- Address text: This is the value of the field you configured as `text_key` (in this tutorial: `address`). It represents the human-readable content we chose to display.\n",
+    "page_content: This is the value of the field you configured as `text_key` (in this tutorial: `title`). It represents the human-readable content we chose to display.\n",
     "\n",
     "### How the Ranking Works\n",
-    "1. Your natural language query (e.g., `\"Woodhead Road\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n",
-    "2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"vec_addr_descr_id\"`).\n",
+    "1. Your natural language query (e.g., `\"Which hotels have the best Infrastucture and the services of the hotel is very good?\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n",
     "3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n",
     "\n",
     "\n",