add content to batch example (jaybaird#29)

FelipeAdachi · web-flow · commit da4cce12e990 · 2023-05-17T06:10:41.000-07:00
diff --git a/langkit/examples/Batch_to_Whylabs.ipynb b/langkit/examples/Batch_to_Whylabs.ipynb
@@ -9,25 +9,52 @@
     ">*Did you know you can store, visualize, and monitor language model profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=langkit)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=langkit) to leverage the power of LangKit and WhyLabs together!*"
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Logging and Monitoring Text Metrics for LLMs with LangKit and WhyLabs"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/LanguageToolkit/blob/main/langkit/examples/Batch_to_Whylabs.ipynb)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this example, we'll show how you can generate out-of-the-box text metrics using LangKit and whylogs, and then log and monitor them in the WhyLabs Observability Platform.\n",
+    "\n",
+    "With LangKit, you'll be able to extract relevant signals from unstructured text data, such as:\n",
+    "\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading the Dataset - Chatbot prompts\n",
+    "\n",
+    "Let's first download a huggingface dataset containint prompts and responses from a chatbot. We'll generate text metrics for the prompts and responses, and then log them to WhyLabs."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "from datasets import load_dataset\n",
-    "from whylogs.experimental.core.metrics.udf_metric import generate_udf_schema\n",
-    "from whylogs.core.schema import DeclarativeSchema\n",
-    "\n",
-    "from langkit.sentiment import *\n",
-    "from langkit.textstat import *\n",
-    "from langkit.regexes import *\n",
-    "from langkit.themes import *\n",
-    "\n",
-    "print(\"downloading models and initialized metrics...\")\n",
-    "text_schema = DeclarativeSchema(generate_udf_schema())\n",
     "print(\"initialize hugging face archived chat prompt/response dataset...\")\n",
-    "archived_chats = load_dataset('alespalla/chatbot_instruction_prompts', split=\"test\", streaming=True)\n"
+    "archived_chats = load_dataset('alespalla/chatbot_instruction_prompts', split=\"test\", streaming=True)"
    ]
   },
   {
@@ -72,6 +99,51 @@
     "print(\"Using API Key ID: \", os.environ[\"WHYLABS_API_KEY\"][0:10])"
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initializing Metrics from LangKit\n",
+    "\n",
+    "In order to calculate the text metrics, we simply need to import the relevant modules from `LangKit`. In this case, we will calculate metrics using the following modules:\n",
+    "\n",
+    "- textstat: text statistics such as scores for readability, complexity, and grade\n",
+    "- sentiment: sentiment scores\n",
+    "- regexes: label text according to user-defined regex pattern groups\n",
+    "- themes: compute sentence similarity scores with respect to groups of: a) known jailbreak and b) LLM refusal of service responses\n",
+    "\n",
+    "After importing the modules, we can generate a schema that will inform whylogs of the metrics we want to calculate. We can then use this schema to log our data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from whylogs.experimental.core.metrics.udf_metric import generate_udf_schema\n",
+    "from whylogs.core.schema import DeclarativeSchema\n",
+    "\n",
+    "from langkit.sentiment import *\n",
+    "from langkit.textstat import *\n",
+    "from langkit.regexes import *\n",
+    "from langkit.themes import *\n",
+    "\n",
+    "print(\"downloading models and initialized metrics...\")\n",
+    "text_schema = DeclarativeSchema(generate_udf_schema())\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Profiling and Writing to WhyLabs - Single Example\n",
+    "\n",
+    "The following code block will log a single prompt/response pair. The resulting profile will then be sent over to your dashboard at WhyLabs."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -100,6 +172,16 @@
     "print()\n"
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Profiling and Writing to WhyLabs - Multiple Batches\n",
+    "\n",
+    "Let's get us closer to a real scenario. If you have an LLM-powered system, you'll be interested in monitoring your text inputs/outputs in a streaming fashion. In this case, we'll simulate a streaming scenario by iterating through the examples and logging them into daily batches. Let's say we have 7 days worth of data, with 10 examples per day."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -133,6 +215,16 @@
     "  print()\n",
     "print(\"Done. Go see your metrics on the WhyLabs dashboard!\")"
    ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And that's it! You can now go to your WhyLabs dashboard and explore the profiles for the past 7 days.\n",
+    "\n",
+    "Feel free to play around with the code and the metrics. You can inject anomalies manually to see how the metrics change, or you can set monitors and alert over at the WhyLabs dashboard."
+   ]
   }
  ],
  "metadata": {