docs: Qdrant RM example

Anush008 · Anush008 · commit b0b03084a1dc · 2024-05-18T22:43:01.000+05:30
diff --git a/dspy/retrieve/qdrant_rm.py b/dspy/retrieve/qdrant_rm.py
@@ -108,7 +108,7 @@ def forward(self, query_or_queries: Union[str, list[str]], k: Optional[int] = No
         # Wrap each sorted passage in a dotdict with 'long_text'
         return [dotdict({"long_text": passage}) for passage, _ in sorted_passages]
 
-    def _get_first_vector_name(self) -> str | None:
+    def _get_first_vector_name(self) -> Optional[str]:
         vectors = self._client.get_collection(self._collection_name).config.params.vectors
 
         if not isinstance(vectors, dict):
diff --git a/examples/integrations/qdrant/qdrant_retriever_example.ipynb b/examples/integrations/qdrant/qdrant_retriever_example.ipynb
@@ -0,0 +1,302 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# DSPy Retriever using Qdrant\n",
+        "\n",
+        "This notebook will walk you through using Qdrant as retriever in DSPy. We'll be loading a dataset into Qdrant and retrieving relevant context from it in our DSPy retriever."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Setup"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "%pip install dspy-ai[qdrant]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Configure Constants"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "This notebook assumes, you have a Qdrant instance running at http://localhost:6333/. To learn more about setting up Qdrant, you can refer to the [quickstart guide](https://qdrant.tech/documentation/quick-start/)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "COLLECTION_NAME = \"DBPEDIA-DSPY\"\n",
+        "QDRANT_URL = \"http://localhost:6333\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Ingesting data\n",
+        "\n",
+        "We'll load the [Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K) dataset that contains info from DBPedia and embeddings pre-computed using OpenAI's `text-embedding-3-small`!"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "%pip install datasets"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from datasets import load_dataset\n",
+        "\n",
+        "# We will use a small subset of the dataset\n",
+        "dataset = (\n",
+        "    load_dataset(\n",
+        "        \"Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K\",\n",
+        "        streaming=True,\n",
+        "        split=\"train\",\n",
+        "    )\n",
+        "    .take(1000)\n",
+        "    .remove_columns([\"openai\", \"combined_text\"])\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Set up a client that points to a Qdrant instance at http://localhost:6333/."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from qdrant_client import QdrantClient\n",
+        "\n",
+        "client = QdrantClient(url=QDRANT_URL)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We [create a collection](https://qdrant.tech/documentation/concepts/collections/#create-a-collection) with the appropriate dimensions and distance metric to load our dataset into."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "True"
+            ]
+          },
+          "execution_count": 4,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "from qdrant_client import models\n",
+        "\n",
+        "client.create_collection(\n",
+        "    collection_name=COLLECTION_NAME,\n",
+        "    vectors_config=models.VectorParams(\n",
+        "        size=1536,\n",
+        "        distance=models.Distance.COSINE,\n",
+        "    ),\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can now load the dataset to be indexed in Qdrant. The `upload_collection` methods accepts argumens to configure the batch size and parallelism. We'll go with the defaults."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "vectors = [entry.pop(\"text-embedding-3-small-1536-embedding\") for entry in dataset]\n",
+        "\n",
+        "client.upload_collection(collection_name=COLLECTION_NAME, vectors=vectors, payload=dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The loading is now complete. You can browse through the entries at http://localhost:6333/dashboard."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Initialize Qdrant retriever and OpenAI vectorizer"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The Qdrant retriever allows us to configure the vectorizer to use. We'll use the `OpenAIVectorizer` with the `text-embedding-3-small` model as per our dataset.\n",
+        "\n",
+        "We can also specify the field in our Qdrant payload with the document content. In our case, it's `\"text\"`. Based on the dataset we loaded."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "os.environ[\"OPENAI_API_KEY\"] = \"<YOUR_OPENAI_API_KEY>\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from dsp.modules.sentence_vectorizer import OpenAIVectorizer\n",
+        "\n",
+        "vectorizer = OpenAIVectorizer(model=\"text-embedding-3-small\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from dspy.retrieve.qdrant_rm import QdrantRM\n",
+        "\n",
+        "qdrant_retriever = QdrantRM(\n",
+        "    qdrant_client=client,\n",
+        "    qdrant_collection_name=COLLECTION_NAME,\n",
+        "    vectorizer=vectorizer,\n",
+        "    document_field=\"text\",\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "With the `qdrant_retriever` now instantiated, we can now configure `dspy` to use it."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import dspy\n",
+        "\n",
+        "dspy.settings.configure(rm=qdrant_retriever)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Trying out the retriever\n",
+        "\n",
+        "We can use the `dspy.Retrieve` class to query our retriever. Similar to how it's done in the DSPy RAG pipelines."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "Prediction(\n",
+              "    passages=['CounterSpy is a proprietary spyware removal program for Microsoft Windows software developed by Sunbelt Software.', 'In computing, the diff utility is a data comparison tool that calculates and displays the differences between two files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other.', \"AudioDesk is an audio workstation application by Mark of the Unicorn (MOTU) for the Mac OS. It is a multi-track recording, editing, and mixing application, with both offline file-based processing and realtime effects. It is a more basic version of MOTU's Digital Performer  DAW software. Much of the graphical user interface (GUI) and its operation are similar to Digital Performer, although it lacks some of Digital Performer's features.\"]\n",
+              ")"
+            ]
+          },
+          "execution_count": 6,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "retrieve = dspy.Retrieve()\n",
+        "\n",
+        "retrieve(\"Some computer programs.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We are able to successfully retrieve results relevant to the query from our Qdrant collection."
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.13"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}