Skip to content

Commit 8cb8715

Browse files
Updated tutorial to use couchbase hyperscale vector index
1 parent 97392d9 commit 8cb8715

File tree

2 files changed

+39
-61
lines changed

2 files changed

+39
-61
lines changed

autovec-tutorial/__frontmatter__.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
---
22
# frontmatter
33
path: "/tutorial-couchbase-autovectorization-langchain"
4-
title: Auto-Vectorization with Couchbase Capella AI Services and LangChain
4+
title: Auto-Vectorization of Strucutured Data with Couchbase Capella AI Services
55
short_title: Auto-Vectorization with Couchbase and LangChain
66
description:
7-
- Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your data into vector embeddings.
7+
- Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your structured data into vector embeddings. To learn about the auto-vectorization of unstuctured data read the following ["/tutorial"].
88
- This tutorial demonstrates how to set up automated embedding generation workflows and perform semantic search using LangChain.
99
content_type: tutorial
1010
filter: sdk

autovec-tutorial/autovec_langchain.ipynb

Lines changed: 37 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@
144144
"source": [
145145
"# 5. Vector Search\n",
146146
"\n",
147-
"The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. These searches are powered by **Couchbase's Search service**.\n",
147+
"The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. These searches are powered by **Couchbase's Vector Search service using Hyperscale Indexes**.\n",
148148
"\n",
149149
"Before you proceed, make sure the following packages are installed by running:"
150150
]
@@ -178,7 +178,7 @@
178178
},
179179
{
180180
"cell_type": "code",
181-
"execution_count": 6,
181+
"execution_count": 20,
182182
"id": "30955126-0053-4cec-9dec-e4c05a8de7c3",
183183
"metadata": {},
184184
"outputs": [],
@@ -188,7 +188,8 @@
188188
"from couchbase.options import ClusterOptions\n",
189189
"\n",
190190
"from langchain_openai import OpenAIEmbeddings\n",
191-
"from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n",
191+
"from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n",
192+
"from langchain_couchbase.vectorstores import DistanceStrategy\n",
192193
"\n",
193194
"from datetime import timedelta"
194195
]
@@ -204,19 +205,17 @@
204205
},
205206
{
206207
"cell_type": "code",
207-
"execution_count": 7,
208+
"execution_count": 3,
208209
"id": "7e4c9e8d",
209210
"metadata": {},
210211
"outputs": [],
211212
"source": [
212-
"endpoint = \"couchbases://cb.f-znsfdbilcp-ja4.sandbox.nonprod-project-avengers.com\" # Replace this with Connection String\n",
213-
"username = \"testing\" # Replace this with your username\n",
214-
"password = \"Testing@1\" # Replace this with your password\n",
213+
"endpoint = \"CLUSTER_CONNECT_STRING\" # Replace this with Connection String\n",
214+
"username = \"CLUSTER_USERNAME\"\n",
215+
"password = \"CLUSTER_PASSWORD\" \n",
215216
"auth = PasswordAuthenticator(username, password)\n",
216-
"\n",
217-
"options = ClusterOptions(auth, tls_verify='none')\n",
217+
"options = ClusterOptions(auth)\n",
218218
"cluster = Cluster(endpoint, options)\n",
219-
"\n",
220219
"cluster.wait_until_ready(timedelta(seconds=5))"
221220
]
222221
},
@@ -227,7 +226,6 @@
227226
"source": [
228227
"# Selection of Buckets / Scope / Collection / Index / Embedder\n",
229228
" - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n",
230-
" - `index_name` specifies the **Capella Search index name**. This is the Search index created automatically during the workflow setup (step 4.5) or manually as described in the same step. You can find this index name in the **Search** tab of your Capella cluster.\n",
231229
" - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.\n",
232230
" - `open_api_key` is the api key token created in `step 3.2 -3`.\n",
233231
" - `open_api_base` is the Capella model services endpoint found in the models section.\n",
@@ -238,21 +236,18 @@
238236
},
239237
{
240238
"cell_type": "code",
241-
"execution_count": null,
239+
"execution_count": 21,
242240
"id": "799b2efc",
243241
"metadata": {},
244242
"outputs": [],
245243
"source": [
246244
"bucket_name = \"travel-sample\"\n",
247245
"scope_name = \"inventory\"\n",
248246
"collection_name = \"hotel\"\n",
249-
"index_name = \"hyperscale_autovec_workflow_vec_addr_descr_id\" # This is the name of the search index that was created in step 4.5 and can also be seen in the search tab of the cluster.\n",
250-
" # It should be noted that hyperscale_workflow_name_index_fieldname is the naming convention for the index created by AutoVectorization workflow where\n",
251-
" # fieldname is the name of the field being indexed.\n",
252247
"\n",
253248
"# Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain\n",
254249
"embedder = OpenAIEmbeddings(\n",
255-
" model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\", # This is the model that will be used to create the embedding of the query.\n",
250+
" model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\", # This is the model that will be used to create the embedding of the query.\n",
256251
" openai_api_key=\"CAPELLA_MODEL_KEY\",\n",
257252
" openai_api_base=\"CAPELLA_MODEL_ENDPOINT/v1\",\n",
258253
" check_embedding_ctx_length=False,\n",
@@ -266,47 +261,29 @@
266261
"metadata": {},
267262
"source": [
268263
"# VectorStore Construction\n",
269-
" - Creates a [CouchbaseSearchVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) instance that interfaces with **Couchbase's Search service** to perform vector similarity searches.\n",
264+
" - Creates a [CouchbaseQueryVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-query-vector-store) instance that interfaces with **Couchbase's Query service** to perform vector similarity searches using [Hyperscale/Composite](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html) indexes. \n",
270265
" - The vector store:\n",
271266
" * Knows where to read documents (`bucket/scope/collection`).\n",
272-
" * References the Search index (`index_name`) that contains vector field mappings.\n",
273267
" * Knows the embedding field (the vector produced by the Auto-Vectorization workflow).\n",
274268
" * Uses the provided embedder to embed queries on-demand for similarity search.\n",
275-
" - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n",
276-
" - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields."
269+
" - You can choose any representative field for `text_key` for which you want to get the data. Over here we have specified the title of the document."
277270
]
278271
},
279272
{
280273
"cell_type": "code",
281-
"execution_count": 21,
274+
"execution_count": 17,
282275
"id": "50b85f78",
283276
"metadata": {},
284-
"outputs": [
285-
{
286-
"ename": "ValueError",
287-
"evalue": "Index hyperscale_autovec_workflow_vec_addr_descr_id does not exist. Please create the index before searching.",
288-
"output_type": "error",
289-
"traceback": [
290-
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
291-
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
292-
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[21]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m vector_store = \u001b[43mCouchbaseSearchVectorStore\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 2\u001b[39m \u001b[43m \u001b[49m\u001b[43mcluster\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcluster\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 3\u001b[39m \u001b[43m \u001b[49m\u001b[43mbucket_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbucket_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 4\u001b[39m \u001b[43m \u001b[49m\u001b[43mscope_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mscope_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 5\u001b[39m \u001b[43m \u001b[49m\u001b[43mcollection_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcollection_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 6\u001b[39m \u001b[43m \u001b[49m\u001b[43membedding\u001b[49m\u001b[43m=\u001b[49m\u001b[43membedder\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 7\u001b[39m \u001b[43m \u001b[49m\u001b[43mindex_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mindex_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 8\u001b[39m \u001b[43m \u001b[49m\u001b[43mtext_key\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43maddress\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# Your document's text field\u001b[39;49;00m\n\u001b[32m 9\u001b[39m \u001b[43m \u001b[49m\u001b[43membedding_key\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mvec_addr_descr_id\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# This is the field in which your vector (embedding) is stored in the cluster.\u001b[39;49;00m\n\u001b[32m 10\u001b[39m \u001b[43m)\u001b[49m\n",
293-
"\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:267\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore.__init__\u001b[39m\u001b[34m(self, cluster, bucket_name, scope_name, collection_name, embedding, index_name, text_key, embedding_key, scoped_index)\u001b[39m\n\u001b[32m 265\u001b[39m \u001b[38;5;28mself\u001b[39m._check_index_exists()\n\u001b[32m 266\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m--> \u001b[39m\u001b[32m267\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m e\n",
294-
"\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:265\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore.__init__\u001b[39m\u001b[34m(self, cluster, bucket_name, scope_name, collection_name, embedding, index_name, text_key, embedding_key, scoped_index)\u001b[39m\n\u001b[32m 263\u001b[39m \u001b[38;5;66;03m# Check if the index exists. Throws ValueError if it doesn't\u001b[39;00m\n\u001b[32m 264\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m265\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_check_index_exists\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 266\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m 267\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m e\n",
295-
"\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:192\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore._check_index_exists\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 188\u001b[39m all_indexes = [\n\u001b[32m 189\u001b[39m index.name \u001b[38;5;28;01mfor\u001b[39;00m index \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._scope.search_indexes().get_all_indexes()\n\u001b[32m 190\u001b[39m ]\n\u001b[32m 191\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._index_name \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m all_indexes:\n\u001b[32m--> \u001b[39m\u001b[32m192\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 193\u001b[39m \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mIndex \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m._index_name\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m does not exist. \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 194\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m Please create the index before searching.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 195\u001b[39m )\n\u001b[32m 196\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 197\u001b[39m all_indexes = [\n\u001b[32m 198\u001b[39m index.name \u001b[38;5;28;01mfor\u001b[39;00m index \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._cluster.search_indexes().get_all_indexes()\n\u001b[32m 199\u001b[39m ]\n",
296-
"\u001b[31mValueError\u001b[39m: Index hyperscale_autovec_workflow_vec_addr_descr_id does not exist. Please create the index before searching."
297-
]
298-
}
299-
],
277+
"outputs": [],
300278
"source": [
301-
"vector_store = CouchbaseSearchVectorStore(\n",
279+
"vector_store = CouchbaseQueryVectorStore(\n",
302280
" cluster=cluster,\n",
303281
" bucket_name=bucket_name,\n",
304282
" scope_name=scope_name,\n",
305283
" collection_name=collection_name,\n",
306284
" embedding=embedder,\n",
307-
" index_name=index_name,\n",
308-
" text_key=\"address\", # Your document's text field\n",
309-
" embedding_key=\"vec_addr_descr_id\" # This is the field in which your vector (embedding) is stored in the cluster.\n",
285+
" text_key = \"title\",\n",
286+
" distance_metric=DistanceStrategy.DOT\n",
310287
")"
311288
]
312289
},
@@ -318,26 +295,32 @@
318295
"# Performing a Similarity Search\n",
319296
" - Defines a natural language query (e.g., \"Woodhead Road\").\n",
320297
" - Calls `similarity_search(k=3)` to retrieve the top 3 most semantically similar documents.\n",
321-
" - Prints ranked results, extracting a `title` (if present) and the chosen `text_key` (here `address`).\n",
322-
" - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n",
298+
" - Prints ranked results, extracting a `title` (if present) and the chosen `text_key` (here `title`).\n",
299+
" - Change `query` to any descriptive phrase (e.g., \"airport hotel near NYC\").\n",
323300
" - Adjust `k` for more or fewer results."
324301
]
325302
},
326303
{
327304
"cell_type": "code",
328-
"execution_count": null,
305+
"execution_count": 18,
329306
"id": "177fd6d5",
330307
"metadata": {},
331-
"outputs": [],
308+
"outputs": [
309+
{
310+
"name": "stdout",
311+
"output_type": "stream",
312+
"text": [
313+
"page_content='Gillingham (Kent)'\n",
314+
"page_content='Gillingham (Kent)'\n",
315+
"page_content='Giverny'\n"
316+
]
317+
}
318+
],
332319
"source": [
333-
"query = \"What hotels are there in USA?\"\n",
320+
"query = \"Which hotels have the best Infrastucture and the services of the hotel is very good?\"\n",
334321
"results = vector_store.similarity_search(query, k=3)\n",
335-
"\n",
336-
"# Print out the top-k results\n",
337-
"for rank, doc in enumerate(results, start=1):\n",
338-
" title = doc.metadata.get(\"title\", \"<no title>\")\n",
339-
" address_text = doc.page_content\n",
340-
" print(f\"{rank}. {title} — Address: {address_text}\")"
322+
"for doc in results:\n",
323+
" print(doc)"
341324
]
342325
},
343326
{
@@ -348,15 +331,10 @@
348331
"## 6. Results and Interpretation\n",
349332
"\n",
350333
"As we can see, 3 (or `k`) ranked results are printed in the output.\n",
351-
"\n",
352-
"### What Each Part Means\n",
353-
"- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n",
354-
"- Title: Pulled from `doc.metadata.get(\"title\", \"<no title>\")`. If your documents don't contain a `title` field, you will see `<no title>`.\n",
355-
"- Address text: This is the value of the field you configured as `text_key` (in this tutorial: `address`). It represents the human-readable content we chose to display.\n",
334+
"page_content: This is the value of the field you configured as `text_key` (in this tutorial: `title`). It represents the human-readable content we chose to display.\n",
356335
"\n",
357336
"### How the Ranking Works\n",
358-
"1. Your natural language query (e.g., `\"Woodhead Road\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n",
359-
"2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"vec_addr_descr_id\"`).\n",
337+
"1. Your natural language query (e.g., `\"Which hotels have the best Infrastucture and the services of the hotel is very good?\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n",
360338
"3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n",
361339
"\n",
362340
"\n",

0 commit comments

Comments
 (0)