Improves organization of monitoring quickstart notebook

gustavocidornelas · whoseoyster · commit 6e701031cb79 · 2023-10-09T20:34:37.000-07:00
diff --git a/examples/monitoring/quickstart/monitoring-quickstart.ipynb b/examples/monitoring/quickstart/monitoring-quickstart.ipynb
@@ -10,213 +10,131 @@
     "\n",
     "# <a id=\"top\">Monitoring quickstart</a>\n",
     "\n",
-    "This notebook illustrates a typical monitoring flow using Openlayer.\n",
+    "This notebook illustrates a typical monitoring flow using Openlayer. For more details, refer to the [How to set up monitoring guide](https://docs.openlayer.com/docs/set-up-monitoring) from the documentation.\n",
     "\n",
     "\n",
     "## <a id=\"toc\">Table of contents</a>\n",
     "\n",
-    "1. [**Creating a project and an inference pipeline**](#inference-pipeline)   \n",
+    "1. [**Creating a project and an inference pipeline**](#inference-pipeline) \n",
     "\n",
-    "2. [**Uploading a reference dataset**](#reference-dataset)\n",
+    "2. [**Publishing batches of production data**](#publish-batches)\n",
     "\n",
-    "3. [**Publishing batches of production data**](#publish-batches)\n",
+    "3. [(Optional) **Uploading a reference dataset**](#reference-dataset)\n",
     "\n",
+    "4. [(Optional) **Publishing ground truths**](#ground-truths)\n",
     "\n",
-    "4. [**Publishing ground truths**](#ground-truths)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c4ea849d",
-   "metadata": {},
-   "source": [
-    "## <a id=\"inference-pipeline\"> 1. Creating a project and an inference pipeline </a>\n",
-    "\n",
-    "[Back to top](#top)"
+    "Before we start, let's download the sample data and import pandas."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "05f27b6c",
+   "id": "3d193436",
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install openlayer"
+    "%%bash\n",
+    "\n",
+    "if [ ! -e \"churn_train.csv\" ]; then\n",
+    "    curl \"https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/churn_train.csv\" --output \"churn_train.csv\"\n",
+    "fi\n",
+    "\n",
+    "if [ ! -e \"prod_data_no_ground_truths.csv\" ]; then\n",
+    "    curl \"https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/prod_data_no_ground_truths.csv\" --output \"prod_data_no_ground_truths.csv\"\n",
+    "fi\n",
+    "\n",
+    "if [ ! -e \"prod_ground_truths.csv\" ]; then\n",
+    "    curl \"https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/prod_ground_truths.csv\" --output \"prod_ground_truths.csv\"\n",
+    "fi"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "8504e063",
+   "id": "9dce8f60",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from openlayer.tasks import TaskType\n",
-    "import openlayer\n",
-    "\n",
-    "client = openlayer.OpenlayerClient(\"YOUR_API_KEY_HERE\")\n",
-    "project = client.create_or_load_project(\n",
-    "    name=\"Churn Prediction \",\n",
-    "    task_type=TaskType.TabularClassification,\n",
-    ")"
+    "import pandas as pd"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "ed0c9bf6",
-   "metadata": {},
-   "source": [
-    "Now that you are authenticated and have a project on the platform, it's time to create an inference pipeline. Creating an inference pipeline is what enables the monitoring capabilities in a project."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "147b5294",
+   "id": "c4ea849d",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "inference_pipeline = project.create_inference_pipeline()\n",
+    "## <a id=\"inference-pipeline\"> 1. Creating a project and an inference pipeline </a>\n",
     "\n",
-    "# Or \n",
-    "# inference_pipeline = project.load_inference_pipeline(name=\"Production\")"
+    "[Back to top](#top)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "61e916c2",
+   "id": "05f27b6c",
    "metadata": {},
    "outputs": [],
    "source": [
-    "inference_pipeline"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "39592b32",
-   "metadata": {},
-   "source": [
-    "## <a id=\"reference-dataset\"> 2. Uploading a reference dataset </a>\n",
-    "\n",
-    "[Back to top](#top)\n",
-    "\n",
-    "A reference dataset is optional, but it enables drift monitoring. Ideally, the reference dataset is a representative sample of the training set used to train the deployed model. In this section, we first load the dataset and then we upload it to Openlayer using the `upload_reference_dataframe` method.\n",
-    "\n",
-    "### <a id=\"download-reference\"> Downloading the data </a>"
+    "!pip install openlayer"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "4b5be714",
+   "id": "8504e063",
    "metadata": {},
    "outputs": [],
    "source": [
-    "%%bash\n",
+    "import openlayer\n",
     "\n",
-    "if [ ! -e \"churn_train.csv\" ]; then\n",
-    "    curl \"https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/churn_train.csv\" --output \"churn_train.csv\"\n",
-    "fi"
+    "client = openlayer.OpenlayerClient(\"YOUR_API_KEY_HERE\")"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "31809ca9",
+   "id": "5377494b",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import pandas as pd\n",
+    "from openlayer.tasks import TaskType\n",
     "\n",
-    "training_set = pd.read_csv(\"./churn_train.csv\")"
+    "project = client.create_project(\n",
+    "    name=\"Churn Prediction\",\n",
+    "    task_type=TaskType.TabularClassification,\n",
+    ")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "a6336802",
-   "metadata": {},
-   "source": [
-    "### <a id=\"upload-reference\"> Uploading the dataset to Openlayer </a>"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0f8e23e3",
+   "id": "ed0c9bf6",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "dataset_config = {\n",
-    "    \"categoricalFeatureNames\": [\"Gender\", \"Geography\"],\n",
-    "    \"classNames\": [\"Retained\", \"Exited\"],\n",
-    "        \"featureNames\": [\n",
-    "        \"CreditScore\", \n",
-    "        \"Geography\",\n",
-    "        \"Gender\",\n",
-    "        \"Age\", \n",
-    "        \"Tenure\",\n",
-    "        \"Balance\",\n",
-    "        \"NumOfProducts\",\n",
-    "        \"HasCrCard\",\n",
-    "        \"IsActiveMember\",\n",
-    "        \"EstimatedSalary\",\n",
-    "        \"AggregateRate\",\n",
-    "        \"Year\"\n",
-    "    ],\n",
-    "    \"labelColumnName\": \"Exited\",\n",
-    "    \"label\": \"training\"\n",
-    "}"
+    "Now that you are authenticated and have a project on the platform, it's time to create an inference pipeline. Creating an inference pipeline is what enables the monitoring capabilities in a project."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f6cf719f",
+   "id": "147b5294",
    "metadata": {},
    "outputs": [],
    "source": [
-    "inference_pipeline.upload_reference_dataframe(\n",
-    "    dataset_df=training_set,\n",
-    "    dataset_config=dataset_config\n",
-    ")"
+    "inference_pipeline = project.create_inference_pipeline()"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "3c8608ea",
    "metadata": {},
    "source": [
-    "## <a id=\"publish-batches\"> 3. Publishing batches of data </a>\n",
+    "## <a id=\"publish-batches\"> 2. Publishing production data </a>\n",
     "\n",
     "[Back to top](#top)\n",
     "\n",
     "In production, as the model makes predictions, the data can be published to Openlayer. This is done with the `publish_batch_data` method. \n",
     "\n",
-    "The data published to Openlayer can have a column with **inference ids** and another with **timestamps** (UNIX ms format). These are both optional and, if not provided, will receive default values. The inference id is particularly important if you wish to publish ground truths at a later time. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e83afd5b",
-   "metadata": {},
-   "source": [
-    "### <a id=\"download-batches\"> Download the data </a>"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "80a5462e",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%%bash\n",
-    "\n",
-    "if [ ! -e \"prod_data_no_ground_truths.csv\" ]; then\n",
-    "    curl \"https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/prod_data_no_ground_truths.csv\" --output \"prod_data_no_ground_truths.csv\"\n",
-    "fi"
+    "The data published to Openlayer can have a column with **inference ids** and another with **timestamps** (UNIX sec format). These are both optional and, if not provided, will receive default values. The inference id is particularly important if you wish to publish ground truths at a later time. "
    ]
   },
   {
@@ -244,7 +162,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "cc126446",
+   "id": "25b66229",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -332,36 +250,94 @@
   },
   {
    "cell_type": "markdown",
-   "id": "fbc1fca3",
+   "id": "d00f6e8e",
    "metadata": {},
    "source": [
-    "## <a id=\"ground-truths\"> 4. Publishing ground truths for past batches </a>\n",
+    "**That's it!** You're now able to set up goals and alerts for your production data. The next sections are optional and enable some features on the platform."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39592b32",
+   "metadata": {},
+   "source": [
+    "## <a id=\"reference-dataset\"> 3. Uploading a reference dataset </a>\n",
     "\n",
     "[Back to top](#top)\n",
     "\n",
-    "The `publish_ground_truths` method can be used to update the ground truths for batches of data already published to the Openlayer platform. The inference id is what gets used to merge the ground truths with the corresponding rows."
+    "A reference dataset is optional, but it enables drift monitoring. Ideally, the reference dataset is a representative sample of the training set used to train the deployed model. In this section, we first load the dataset and then we upload it to Openlayer using the `upload_reference_dataframe` method."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "31809ca9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "training_set = pd.read_csv(\"./churn_train.csv\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "95bc9594",
+   "id": "a6336802",
    "metadata": {},
    "source": [
-    "### <a id=\"download-truth\"> Download the data </a>"
+    "### <a id=\"upload-reference\"> Uploading the dataset to Openlayer </a>"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "9ab9790c",
+   "id": "0f8e23e3",
    "metadata": {},
    "outputs": [],
    "source": [
-    "%%bash\n",
+    "dataset_config = {\n",
+    "    \"categoricalFeatureNames\": [\"Gender\", \"Geography\"],\n",
+    "    \"classNames\": [\"Retained\", \"Exited\"],\n",
+    "        \"featureNames\": [\n",
+    "        \"CreditScore\", \n",
+    "        \"Geography\",\n",
+    "        \"Gender\",\n",
+    "        \"Age\", \n",
+    "        \"Tenure\",\n",
+    "        \"Balance\",\n",
+    "        \"NumOfProducts\",\n",
+    "        \"HasCrCard\",\n",
+    "        \"IsActiveMember\",\n",
+    "        \"EstimatedSalary\",\n",
+    "        \"AggregateRate\",\n",
+    "        \"Year\"\n",
+    "    ],\n",
+    "    \"labelColumnName\": \"Exited\",\n",
+    "    \"label\": \"training\"\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f6cf719f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inference_pipeline.upload_reference_dataframe(\n",
+    "    dataset_df=training_set,\n",
+    "    dataset_config=dataset_config\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fbc1fca3",
+   "metadata": {},
+   "source": [
+    "## <a id=\"ground-truths\"> 4. Publishing ground truths for past batches </a>\n",
     "\n",
-    "if [ ! -e \"prod_ground_truths.csv\" ]; then\n",
-    "    curl \"https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/prod_ground_truths.csv\" --output \"prod_ground_truths.csv\"\n",
-    "fi"
+    "[Back to top](#top)\n",
+    "\n",
+    "The ground truths are needed to create Performance goals. The `publish_ground_truths` method can be used to update the ground truths for batches of data already published to the Openlayer platform. The inference id is what gets used to merge the ground truths with the corresponding rows."
    ]
   },
   {