add new batch creation method, clean up notebook (#1231)

ezekielemerson · zeke-emerson · web-flow · commit 6a1489649732 · 2023-09-08T22:03:07.000+02:00
Co-authored-by: ezekielemerson &lt;eemerson2325@gmail.com&gt;
diff --git a/examples/basics/batches.ipynb b/examples/basics/batches.ipynb
@@ -30,29 +30,28 @@
     {
       "metadata": {},
       "source": [
-        "## Batches\n",
+        "# Batches\n",
         "https://docs.labelbox.com/docs/batches"
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "* A Batch is collection of datarows picked out of a Data Set.\n",
-        "* A Datarow cannot be part of more than one batch in a project.\n",
-        "* Batches work for all data types, but there should only be one data type per batch.\n",
-        "* Batches may not be shared between projects.\n",
-        "* Batches may have Datarows from multiple Datasets.\n",
-        "* Datarows can only be attached to a Project as part of a single Batch.\n",
-        "* Currently only benchmarks quality settings is supported in batch projects\n",
-        "* You can set priority for each Batch."
+        "* A batch is collection of data rows.\n",
+        "* A data row cannot be part of more than one batch in a given project.\n",
+        "* Batches work for all data types, but there can only be one data type per project.\n",
+        "* Batches can not be shared between projects.\n",
+        "* Batches may have data rows from multiple datasets.\n",
+        "* Currently, only benchmarks quality settings is supported in batch projects\n",
+        "* You can set the priority for each batch."
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "!pip install \"labelbox[data]\""
+        "!pip install \"labelbox[data]\" -q"
       ],
       "cell_type": "code",
       "outputs": [],
@@ -72,22 +71,29 @@
     {
       "metadata": {},
       "source": [
-        "# API Key and Client\n",
-        "Provide a valid api key below in order to properly connect to the Labelbox Client."
+        "## API key and client\n",
+        "Provide a valid API key below in order to properly connect to the Labelbox Client."
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "# Add your api key\n",
-        "API_KEY = None\n",
+        "# Add your API key\n",
+        "API_KEY = \"\"\n",
         "client = lb.Client(api_key=API_KEY)"
       ],
       "cell_type": "code",
       "outputs": [],
       "execution_count": null
     },
+    {
+      "metadata": {},
+      "source": [
+        "## Create a dataset and data rows"
+      ],
+      "cell_type": "markdown"
+    },
     {
       "metadata": {},
       "source": [
@@ -108,90 +114,65 @@
         "print(\"RESULT URL: \", data_rows.result_url)"
       ],
       "cell_type": "code",
-      "outputs": [
-        {
-          "name": "stderr",
-          "output_type": "stream",
-          "text": [
-            "WARNING:labelbox.schema.task:There are errors present. Please look at `task.errors` for more details\n"
-          ]
-        },
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "ERRORS:  []\n",
-            "RESULT URL:  https://storage.labelbox.com/cl3ahv73w1891087qbwzs3edd%2Fdata-row-imports-results%2Fcl94vbi4g4ijw07y07shadc7k_cl94vbjcv1dh707y2f2g4cwh4.json?Expires=1665619363366&KeyName=labelbox-assets-key-3&Signature=VJOqZZUjnnT4s45on3zzYdcagOs\n"
-          ]
-        }
-      ],
+      "outputs": [],
       "execution_count": null
     },
     {
       "metadata": {},
       "source": [
-        "# Setup batch project"
+        "## Setup batch project"
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "project = client.create_project( name=\"Demo-Batches-Project\",                                 \n",
-        "                                  media_type=lb.MediaType.Image\n",
-        "                                )\n",
-        "print(\"Project Name:\", project.name ,\n",
-        "      \" Project Id:\", project.uid  )"
+        "project = client.create_project(\n",
+        "  name=\"Demo-Batches-Project\",                                 \n",
+        "  media_type=lb.MediaType.Image\n",
+        ")\n",
+        "print(\"Project Name: \", project.name, \"Project ID: \", project.uid)"
       ],
       "cell_type": "code",
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Project Name: Demo-Batches-Project  Project Id: cl94vbpr849gg08ytd6rd423x\n"
-          ]
-        }
-      ],
+      "outputs": [],
       "execution_count": null
     },
     {
       "metadata": {},
       "source": [
-        "### Select all data rows from the dataset created earlier that will be added to the batch.\n"
+        "## Create batches"
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "data_row_ids = [dr.uid for dr in dataset.export_data_rows()]\n",
-        "print(\"Number of data row ids:\", len(data_row_ids))"
+        "### Select all data rows from the dataset\n"
       ],
-      "cell_type": "code",
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Number of data row ids: 8\n"
-          ]
-        }
+      "cell_type": "markdown"
+    },
+    {
+      "metadata": {},
+      "source": [
+        "global_keys = [data_row.global_key for data_row in dataset.export_data_rows()]\n",
+        "print(\"Number of global keys:\", len(global_keys))"
       ],
+      "cell_type": "code",
+      "outputs": [],
       "execution_count": null
     },
     {
       "metadata": {},
       "source": [
-        "## Select a random sample\n",
+        "### Select a random sample\n",
         "This method is useful if you have large datasets and only want to work with a handful of data rows"
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "sample = random.sample(data_row_ids, 4)"
+        "sample = random.sample(global_keys, 4)"
       ],
       "cell_type": "code",
       "outputs": [],
@@ -200,83 +181,140 @@
     {
       "metadata": {},
       "source": [
-        "# Batch Manipulation"
+        "### Create a batch\n",
+        "This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method."
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "### Create a Batch:\n"
+        "batch = project.create_batch(\n",
+        "  name=\"Demo-First-Batch\", # Each batch in a project must have a unique name\n",
+        "  global_keys=sample, # A list of data rows or data row ids\n",
+        "  priority=5 # priority between 1(Highest) - 5(lowest)\n",
+        ")\n",
+        "# number of data rows in the batch\n",
+        "print(\"Number of data rows in batch: \", batch.size)"
+      ],
+      "cell_type": "code",
+      "outputs": [],
+      "execution_count": null
+    },
+    {
+      "metadata": {},
+      "source": [
+        "### Create multiple batches\n",
+        "The `project.create_batches()` method accepts up to 1 million data rows.  Batches are chunked into groups of 100k if necessary, which is the maximum batch size. This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method.\n",
+        "\n",
+        "This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method. Batches will be created with the specified `name_prefix` argument and a unique suffix to ensure unique batch names. The suffix will be a 4-digit number starting at `0000`.\n",
+        "\n",
+        "For example, if the name prefix is `demo-create-batches-` and three batches are created, the names will be `demo-create-batches-0000`, `demo-create-batches-0001`, and `demo-create-batches-0002`. This method will throw an error if a batch with the same name already exists.\n",
+        "\n",
+        "In the code below, only one batch will be created, since we are only using the few data rows we created above. Creating over 100k data rows for this demonstration is not sensible, but this method is the preferred approach for batch creation as it will gracefully handle massive sets of data rows."
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "batch = project.create_batch(\n",
-        "  \"Demo-First-Batch\", # Each batch in a project must have a unique name\n",
-        "  sample, # A list of data rows or data row ids\n",
-        "  5 # priority between 1(Highest) - 5(lowest)\n",
+        "# First, we must create a second project so that we can re-use the data rows we already created.\n",
+        "second_project = client.create_project(\n",
+        "  name=\"Second-Demo-Batches-Project\",                                 \n",
+        "  media_type=lb.MediaType.Image\n",
         ")\n",
-        "# number of data rows in the batch\n",
-        "print(\"Number of data rows in batch: \", batch.size)"
+        "print(\"Project Name: \", second_project.name, \"Project ID: \", second_project.uid)\n",
+        "\n",
+        "# Then, use the method that will create multiple batches if necessary.\n",
+        "task = second_project.create_batches(\n",
+        "  name_prefix=\"demo-create-batches-\",\n",
+        "  global_keys=global_keys,\n",
+        "  priority=5\n",
+        ")\n",
+        "\n",
+        "print(\"Errors: \", task.errors())\n",
+        "print(\"Result: \", task.result())"
       ],
       "cell_type": "code",
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Number of data rows in batch:  4\n"
-          ]
-        }
+      "outputs": [],
+      "execution_count": null
+    },
+    {
+      "metadata": {},
+      "source": [
+        "### Create batches from a dataset\n",
+        "\n",
+        "If you wish to create batches in a project using all the data rows of a dataset, instead of having to gather global keys or ID and using subsets of data rows, you can use the `project.create_batches_from_dataset()` method. This method takes in a dataset ID and creates a batch (or batches if there are more than 100k data rows) comprised of all data rows not already in the project.\n",
+        "\n",
+        "The same logic applies to the `name_prefix` argument and the naming of batches as described in the section immediately above."
+      ],
+      "cell_type": "markdown"
+    },
+    {
+      "metadata": {},
+      "source": [
+        "# First, we must create a third project so that we can re-use the data rows we already created.\n",
+        "third_project = client.create_project(\n",
+        "  name=\"Third-Demo-Batches-Project\",                                 \n",
+        "  media_type=lb.MediaType.Image\n",
+        ")\n",
+        "print(\"Project Name: \", third_project.name, \"Project ID: \", third_project.uid)\n",
+        "\n",
+        "# Then, use the method to create batches from a dataset.\n",
+        "task = third_project.create_batches_from_dataset(\n",
+        "    name_prefix=\"demo-batches-from-dataset-\",\n",
+        "    dataset_id=dataset.uid,\n",
+        "    priority=5\n",
+        ")\n",
+        "\n",
+        "print(\"Errors: \", task.errors())\n",
+        "print(\"Result: \", task.result())"
       ],
+      "cell_type": "code",
+      "outputs": [],
       "execution_count": null
     },
     {
       "metadata": {},
       "source": [
-        "### Manage Batches\n",
-        "Note: You can view your batch data through the *Data Rows tab*"
+        "## Manage batches\n",
+        "Note: You can view your batch data through the **Data Rows** tab."
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "## Export the data row ids\n",
+        "### View batches"
+      ],
+      "cell_type": "markdown"
+    },
+    {
+      "metadata": {},
+      "source": [
+        "## Export the data row iDs\n",
         "data_rows = [dr for dr in batch.export_data_rows()]\n",
-        "print(\"Data Rows in Batch: \", data_rows)\n",
+        "print(\"Data rows in batch: \", data_rows)\n",
         "\n",
         "## List the batches in your project\n",
         "for batch in project.batches():\n",
-        "    print(\"Batch Name: \", batch.name , \"  Batch ID:\", batch.uid)\n"
+        "    print(\"Batch name: \", batch.name , \"  Batch ID:\", batch.uid)\n"
       ],
       "cell_type": "code",
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Data Rows in Batch:  [<DataRow ID: cl94vbjjn0wb8075i74pcb54v>, <DataRow ID: cl94vbjjn0wb0075i9i542qtp>, <DataRow ID: cl94vbjjn0waw075i11rser6b>, <DataRow ID: cl94vbjjn0was075igz3789ff>]\n",
-            "Batch Name:  Demo-First-Batch   Batch ID: 39f3fb00-49c1-11ed-ad8c-4b0085ccfe8b\n"
-          ]
-        }
-      ],
+      "outputs": [],
       "execution_count": null
     },
     {
       "metadata": {},
       "source": [
-        "# Archive Batch"
+        "### Archive a batch"
       ],
       "cell_type": "markdown"
     },
     {
       "metadata": {},
       "source": [
-        "# archiving batch removes all queued data rows from the project\n",
+        "# Archiving a batch removes all queued data rows in the batch from the project\n",
         "batch.remove_queued_data_rows()"
       ],
       "cell_type": "code",
@@ -287,7 +325,7 @@
       "metadata": {},
       "source": [
         "## Clean up \n",
-        "Uncomment and run the cell below to delete the batch, dataset and/or project created in this demo"
+        "Uncomment and run the cell below to optionally delete the batch, dataset, and/or project created in this demo."
       ],
       "cell_type": "markdown"
     },
@@ -308,4 +346,4 @@
       "execution_count": null
     }
   ]
-}
+}