|
144 | 144 | "source": [ |
145 | 145 | "# 5. Vector Search\n", |
146 | 146 | "\n", |
147 | | - "The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. These searches are powered by **Couchbase's Search service**.\n", |
| 147 | + "The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. These searches are powered by **Couchbase's Vector Search service using Hyperscale Indexes**.\n", |
148 | 148 | "\n", |
149 | 149 | "Before you proceed, make sure the following packages are installed by running:" |
150 | 150 | ] |
|
178 | 178 | }, |
179 | 179 | { |
180 | 180 | "cell_type": "code", |
181 | | - "execution_count": 6, |
| 181 | + "execution_count": 20, |
182 | 182 | "id": "30955126-0053-4cec-9dec-e4c05a8de7c3", |
183 | 183 | "metadata": {}, |
184 | 184 | "outputs": [], |
|
188 | 188 | "from couchbase.options import ClusterOptions\n", |
189 | 189 | "\n", |
190 | 190 | "from langchain_openai import OpenAIEmbeddings\n", |
191 | | - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", |
| 191 | + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", |
| 192 | + "from langchain_couchbase.vectorstores import DistanceStrategy\n", |
192 | 193 | "\n", |
193 | 194 | "from datetime import timedelta" |
194 | 195 | ] |
|
204 | 205 | }, |
205 | 206 | { |
206 | 207 | "cell_type": "code", |
207 | | - "execution_count": 7, |
| 208 | + "execution_count": 3, |
208 | 209 | "id": "7e4c9e8d", |
209 | 210 | "metadata": {}, |
210 | 211 | "outputs": [], |
211 | 212 | "source": [ |
212 | | - "endpoint = \"couchbases://cb.f-znsfdbilcp-ja4.sandbox.nonprod-project-avengers.com\" # Replace this with Connection String\n", |
213 | | - "username = \"testing\" # Replace this with your username\n", |
214 | | - "password = \"Testing@1\" # Replace this with your password\n", |
| 213 | + "endpoint = \"CLUSTER_CONNECT_STRING\" # Replace this with Connection String\n", |
| 214 | + "username = \"CLUSTER_USERNAME\"\n", |
| 215 | + "password = \"CLUSTER_PASSWORD\" \n", |
215 | 216 | "auth = PasswordAuthenticator(username, password)\n", |
216 | | - "\n", |
217 | | - "options = ClusterOptions(auth, tls_verify='none')\n", |
| 217 | + "options = ClusterOptions(auth)\n", |
218 | 218 | "cluster = Cluster(endpoint, options)\n", |
219 | | - "\n", |
220 | 219 | "cluster.wait_until_ready(timedelta(seconds=5))" |
221 | 220 | ] |
222 | 221 | }, |
|
227 | 226 | "source": [ |
228 | 227 | "# Selection of Buckets / Scope / Collection / Index / Embedder\n", |
229 | 228 | " - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n", |
230 | | - " - `index_name` specifies the **Capella Search index name**. This is the Search index created automatically during the workflow setup (step 4.5) or manually as described in the same step. You can find this index name in the **Search** tab of your Capella cluster.\n", |
231 | 229 | " - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.\n", |
232 | 230 | " - `open_api_key` is the api key token created in `step 3.2 -3`.\n", |
233 | 231 | " - `open_api_base` is the Capella model services endpoint found in the models section.\n", |
|
238 | 236 | }, |
239 | 237 | { |
240 | 238 | "cell_type": "code", |
241 | | - "execution_count": null, |
| 239 | + "execution_count": 21, |
242 | 240 | "id": "799b2efc", |
243 | 241 | "metadata": {}, |
244 | 242 | "outputs": [], |
245 | 243 | "source": [ |
246 | 244 | "bucket_name = \"travel-sample\"\n", |
247 | 245 | "scope_name = \"inventory\"\n", |
248 | 246 | "collection_name = \"hotel\"\n", |
249 | | - "index_name = \"hyperscale_autovec_workflow_vec_addr_descr_id\" # This is the name of the search index that was created in step 4.5 and can also be seen in the search tab of the cluster.\n", |
250 | | - " # It should be noted that hyperscale_workflow_name_index_fieldname is the naming convention for the index created by AutoVectorization workflow where\n", |
251 | | - " # fieldname is the name of the field being indexed.\n", |
252 | 247 | "\n", |
253 | 248 | "# Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain\n", |
254 | 249 | "embedder = OpenAIEmbeddings(\n", |
255 | | - " model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\", # This is the model that will be used to create the embedding of the query.\n", |
| 250 | + " model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\", # This is the model that will be used to create the embedding of the query.\n", |
256 | 251 | " openai_api_key=\"CAPELLA_MODEL_KEY\",\n", |
257 | 252 | " openai_api_base=\"CAPELLA_MODEL_ENDPOINT/v1\",\n", |
258 | 253 | " check_embedding_ctx_length=False,\n", |
|
266 | 261 | "metadata": {}, |
267 | 262 | "source": [ |
268 | 263 | "# VectorStore Construction\n", |
269 | | - " - Creates a [CouchbaseSearchVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) instance that interfaces with **Couchbase's Search service** to perform vector similarity searches.\n", |
| 264 | + " - Creates a [CouchbaseQueryVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-query-vector-store) instance that interfaces with **Couchbase's Query service** to perform vector similarity searches using [Hyperscale/Composite](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html) indexes. \n", |
270 | 265 | " - The vector store:\n", |
271 | 266 | " * Knows where to read documents (`bucket/scope/collection`).\n", |
272 | | - " * References the Search index (`index_name`) that contains vector field mappings.\n", |
273 | 267 | " * Knows the embedding field (the vector produced by the Auto-Vectorization workflow).\n", |
274 | 268 | " * Uses the provided embedder to embed queries on-demand for similarity search.\n", |
275 | | - " - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n", |
276 | | - " - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields." |
| 269 | + " - You can choose any representative field for `text_key` for which you want to get the data. Over here we have specified the title of the document." |
277 | 270 | ] |
278 | 271 | }, |
279 | 272 | { |
280 | 273 | "cell_type": "code", |
281 | | - "execution_count": 21, |
| 274 | + "execution_count": 17, |
282 | 275 | "id": "50b85f78", |
283 | 276 | "metadata": {}, |
284 | | - "outputs": [ |
285 | | - { |
286 | | - "ename": "ValueError", |
287 | | - "evalue": "Index hyperscale_autovec_workflow_vec_addr_descr_id does not exist. Please create the index before searching.", |
288 | | - "output_type": "error", |
289 | | - "traceback": [ |
290 | | - "\u001b[31m---------------------------------------------------------------------------\u001b[39m", |
291 | | - "\u001b[31mValueError\u001b[39m Traceback (most recent call last)", |
292 | | - "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[21]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m vector_store = \u001b[43mCouchbaseSearchVectorStore\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 2\u001b[39m \u001b[43m \u001b[49m\u001b[43mcluster\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcluster\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 3\u001b[39m \u001b[43m \u001b[49m\u001b[43mbucket_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbucket_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 4\u001b[39m \u001b[43m \u001b[49m\u001b[43mscope_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mscope_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 5\u001b[39m \u001b[43m \u001b[49m\u001b[43mcollection_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcollection_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 6\u001b[39m \u001b[43m \u001b[49m\u001b[43membedding\u001b[49m\u001b[43m=\u001b[49m\u001b[43membedder\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 7\u001b[39m \u001b[43m \u001b[49m\u001b[43mindex_name\u001b[49m\u001b[43m=\u001b[49m\u001b[43mindex_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 8\u001b[39m \u001b[43m \u001b[49m\u001b[43mtext_key\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43maddress\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# Your document's text field\u001b[39;49;00m\n\u001b[32m 9\u001b[39m \u001b[43m \u001b[49m\u001b[43membedding_key\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mvec_addr_descr_id\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# This is the field in which your vector (embedding) is stored in the cluster.\u001b[39;49;00m\n\u001b[32m 10\u001b[39m \u001b[43m)\u001b[49m\n", |
293 | | - "\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:267\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore.__init__\u001b[39m\u001b[34m(self, cluster, bucket_name, scope_name, collection_name, embedding, index_name, text_key, embedding_key, scoped_index)\u001b[39m\n\u001b[32m 265\u001b[39m \u001b[38;5;28mself\u001b[39m._check_index_exists()\n\u001b[32m 266\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m--> \u001b[39m\u001b[32m267\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m e\n", |
294 | | - "\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:265\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore.__init__\u001b[39m\u001b[34m(self, cluster, bucket_name, scope_name, collection_name, embedding, index_name, text_key, embedding_key, scoped_index)\u001b[39m\n\u001b[32m 263\u001b[39m \u001b[38;5;66;03m# Check if the index exists. Throws ValueError if it doesn't\u001b[39;00m\n\u001b[32m 264\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m265\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_check_index_exists\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 266\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m 267\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m e\n", |
295 | | - "\u001b[36mFile \u001b[39m\u001b[32m~/vector-search-cookbook/.venv/lib/python3.14/site-packages/langchain_couchbase/vectorstores/search_vector_store.py:192\u001b[39m, in \u001b[36mCouchbaseSearchVectorStore._check_index_exists\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 188\u001b[39m all_indexes = [\n\u001b[32m 189\u001b[39m index.name \u001b[38;5;28;01mfor\u001b[39;00m index \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._scope.search_indexes().get_all_indexes()\n\u001b[32m 190\u001b[39m ]\n\u001b[32m 191\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._index_name \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m all_indexes:\n\u001b[32m--> \u001b[39m\u001b[32m192\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 193\u001b[39m \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mIndex \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m._index_name\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m does not exist. \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 194\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m Please create the index before searching.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 195\u001b[39m )\n\u001b[32m 196\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 197\u001b[39m all_indexes = [\n\u001b[32m 198\u001b[39m index.name \u001b[38;5;28;01mfor\u001b[39;00m index \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._cluster.search_indexes().get_all_indexes()\n\u001b[32m 199\u001b[39m ]\n", |
296 | | - "\u001b[31mValueError\u001b[39m: Index hyperscale_autovec_workflow_vec_addr_descr_id does not exist. Please create the index before searching." |
297 | | - ] |
298 | | - } |
299 | | - ], |
| 277 | + "outputs": [], |
300 | 278 | "source": [ |
301 | | - "vector_store = CouchbaseSearchVectorStore(\n", |
| 279 | + "vector_store = CouchbaseQueryVectorStore(\n", |
302 | 280 | " cluster=cluster,\n", |
303 | 281 | " bucket_name=bucket_name,\n", |
304 | 282 | " scope_name=scope_name,\n", |
305 | 283 | " collection_name=collection_name,\n", |
306 | 284 | " embedding=embedder,\n", |
307 | | - " index_name=index_name,\n", |
308 | | - " text_key=\"address\", # Your document's text field\n", |
309 | | - " embedding_key=\"vec_addr_descr_id\" # This is the field in which your vector (embedding) is stored in the cluster.\n", |
| 285 | + " text_key = \"title\",\n", |
| 286 | + " distance_metric=DistanceStrategy.DOT\n", |
310 | 287 | ")" |
311 | 288 | ] |
312 | 289 | }, |
|
318 | 295 | "# Performing a Similarity Search\n", |
319 | 296 | " - Defines a natural language query (e.g., \"Woodhead Road\").\n", |
320 | 297 | " - Calls `similarity_search(k=3)` to retrieve the top 3 most semantically similar documents.\n", |
321 | | - " - Prints ranked results, extracting a `title` (if present) and the chosen `text_key` (here `address`).\n", |
322 | | - " - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n", |
| 298 | + " - Prints ranked results, extracting a `title` (if present) and the chosen `text_key` (here `title`).\n", |
| 299 | + " - Change `query` to any descriptive phrase (e.g., \"airport hotel near NYC\").\n", |
323 | 300 | " - Adjust `k` for more or fewer results." |
324 | 301 | ] |
325 | 302 | }, |
326 | 303 | { |
327 | 304 | "cell_type": "code", |
328 | | - "execution_count": null, |
| 305 | + "execution_count": 18, |
329 | 306 | "id": "177fd6d5", |
330 | 307 | "metadata": {}, |
331 | | - "outputs": [], |
| 308 | + "outputs": [ |
| 309 | + { |
| 310 | + "name": "stdout", |
| 311 | + "output_type": "stream", |
| 312 | + "text": [ |
| 313 | + "page_content='Gillingham (Kent)'\n", |
| 314 | + "page_content='Gillingham (Kent)'\n", |
| 315 | + "page_content='Giverny'\n" |
| 316 | + ] |
| 317 | + } |
| 318 | + ], |
332 | 319 | "source": [ |
333 | | - "query = \"What hotels are there in USA?\"\n", |
| 320 | + "query = \"Which hotels have the best Infrastucture and the services of the hotel is very good?\"\n", |
334 | 321 | "results = vector_store.similarity_search(query, k=3)\n", |
335 | | - "\n", |
336 | | - "# Print out the top-k results\n", |
337 | | - "for rank, doc in enumerate(results, start=1):\n", |
338 | | - " title = doc.metadata.get(\"title\", \"<no title>\")\n", |
339 | | - " address_text = doc.page_content\n", |
340 | | - " print(f\"{rank}. {title} — Address: {address_text}\")" |
| 322 | + "for doc in results:\n", |
| 323 | + " print(doc)" |
341 | 324 | ] |
342 | 325 | }, |
343 | 326 | { |
|
348 | 331 | "## 6. Results and Interpretation\n", |
349 | 332 | "\n", |
350 | 333 | "As we can see, 3 (or `k`) ranked results are printed in the output.\n", |
351 | | - "\n", |
352 | | - "### What Each Part Means\n", |
353 | | - "- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n", |
354 | | - "- Title: Pulled from `doc.metadata.get(\"title\", \"<no title>\")`. If your documents don't contain a `title` field, you will see `<no title>`.\n", |
355 | | - "- Address text: This is the value of the field you configured as `text_key` (in this tutorial: `address`). It represents the human-readable content we chose to display.\n", |
| 334 | + "page_content: This is the value of the field you configured as `text_key` (in this tutorial: `title`). It represents the human-readable content we chose to display.\n", |
356 | 335 | "\n", |
357 | 336 | "### How the Ranking Works\n", |
358 | | - "1. Your natural language query (e.g., `\"Woodhead Road\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n", |
359 | | - "2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"vec_addr_descr_id\"`).\n", |
| 337 | + "1. Your natural language query (e.g., `\"Which hotels have the best Infrastucture and the services of the hotel is very good?\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n", |
360 | 338 | "3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n", |
361 | 339 | "\n", |
362 | 340 | "\n", |
|
0 commit comments