|
7 | 7 | # MAGIC %md |
8 | 8 | # MAGIC #### Pre-requisites |
9 | 9 | # MAGIC 1. This tutorial notebook requires a Lablbox API Key. Please login to your [Labelbox Account](app.labelbox.com) and generate an [API Key](https://app.labelbox.com/account/api-keys) |
10 | | -# MAGIC 2. A few cells below will install the Labelbox SDK and Connector Library. This install is notebook-scoped and will not affect the rest of your cluster. |
11 | | -# MAGIC 3. Please make sure you are running at least the latest LTS version of Databricks. |
12 | | -# MAGIC |
| 10 | +# MAGIC 2. A few cells below will install the Labelbox SDK and Connector Library. This install is notebook-scoped and will not affect the rest of your cluster. |
| 11 | +# MAGIC 3. Please make sure you are running at least the latest LTS version of Databricks. |
| 12 | +# MAGIC |
13 | 13 | # MAGIC #### Notebook Preview |
14 | | -# MAGIC This notebook will guide you through these steps: |
15 | | -# MAGIC 1. Connect to Labelbox via the SDK |
| 14 | +# MAGIC This notebook will guide you through these steps: |
| 15 | +# MAGIC 1. Connect to Labelbox via the SDK |
16 | 16 | # MAGIC 2. Create a labeling dataset from a table of unstructured data in Databricks |
17 | 17 | # MAGIC 3. Programmatically set up an ontology and labeling project in Labelbox |
18 | | -# MAGIC 4. Load Bronze and Silver annotation tables from an example labeled project |
19 | | -# MAGIC 5. Additional cells describe how to handle video annotations and use Labelbox Diagnostics and Catalog |
20 | | -# MAGIC |
| 18 | +# MAGIC 4. Load Bronze and Silver annotation tables from an example labeled project |
| 19 | +# MAGIC 5. Additional cells describe how to handle video annotations and use Labelbox Diagnostics and Catalog |
| 20 | +# MAGIC |
21 | 21 | # MAGIC Additional documentation links are provided at the end of the notebook. |
22 | 22 |
|
23 | 23 | # COMMAND ---------- |
24 | 24 |
|
25 | 25 | # MAGIC %md |
26 | | -# MAGIC Thanks for trying out the Databricks and Labelbox Connector! You or someone from your organization signed up for a Labelbox trial through Databricks Partner Connect. This notebook was loaded into your Shared directory to help illustrate how Labelbox and Databricks can be used together to power unstructured data workflows. |
27 | | -# MAGIC |
28 | | -# MAGIC Labelbox can be used to rapidly annotate a variety of unstructured data from your Data Lake ([images](https://labelbox.com/product/image), [video](https://labelbox.com/product/video), [text](https://labelbox.com/product/text), and [geospatial tiled imagery](https://docs.labelbox.com/docs/tiled-imagery-editor)) and the Labelbox Connector for Databricks makes it easy to bring the annotations back into your Lakehouse environment for AI/ML and analytical workflows. |
29 | | -# MAGIC |
30 | | -# MAGIC If you would like to watch a video of the workflow, check out our [Data & AI Summit Demo](https://databricks.com/session_na21/productionizing-unstructured-data-for-ai-and-analytics). |
31 | | -# MAGIC |
32 | | -# MAGIC |
| 26 | +# MAGIC Thanks for trying out the Databricks and Labelbox Connector! You or someone from your organization signed up for a Labelbox trial through Databricks Partner Connect. This notebook was loaded into your Shared directory to help illustrate how Labelbox and Databricks can be used together to power unstructured data workflows. |
| 27 | +# MAGIC |
| 28 | +# MAGIC Labelbox can be used to rapidly annotate a variety of unstructured data from your Data Lake ([images](https://labelbox.com/product/image), [video](https://labelbox.com/product/video), [text](https://labelbox.com/product/text), and [geospatial tiled imagery](https://docs.labelbox.com/docs/tiled-imagery-editor)) and the Labelbox Connector for Databricks makes it easy to bring the annotations back into your Lakehouse environment for AI/ML and analytical workflows. |
| 29 | +# MAGIC |
| 30 | +# MAGIC If you would like to watch a video of the workflow, check out our [Data & AI Summit Demo](https://databricks.com/session_na21/productionizing-unstructured-data-for-ai-and-analytics). |
| 31 | +# MAGIC |
| 32 | +# MAGIC |
33 | 33 | # MAGIC <img src="https://labelbox.com/static/images/partnerships/collab-chart.svg" alt="example-workflow" width="800"/> |
34 | | -# MAGIC |
| 34 | +# MAGIC |
35 | 35 | # MAGIC <h5>Questions or comments? Reach out to us at [ecosystem+databricks@labelbox.com](mailto:ecosystem+databricks@labelbox.com) |
36 | 36 |
|
37 | 37 | # COMMAND ---------- |
|
41 | 41 |
|
42 | 42 | # COMMAND ---------- |
43 | 43 |
|
44 | | -#This will import Koalas or Pandas-on-Spark based on your DBR version. |
| 44 | +#This will import Koalas or Pandas-on-Spark based on your DBR version. |
45 | 45 | from pyspark import SparkContext |
46 | 46 | from packaging import version |
| 47 | + |
47 | 48 | sc = SparkContext.getOrCreate() |
48 | 49 | if version.parse(sc.version) < version.parse("3.2.0"): |
49 | | - import databricks.koalas as pd |
50 | | - needs_koalas = True |
| 50 | + import databricks.koalas as pd |
| 51 | + needs_koalas = True |
51 | 52 | else: |
52 | | - import pyspark.pandas as pd |
53 | | - needs_koalas = False |
| 53 | + import pyspark.pandas as pd |
| 54 | + needs_koalas = False |
54 | 55 |
|
55 | 56 | # COMMAND ---------- |
56 | 57 |
|
57 | 58 | # MAGIC %md |
58 | 59 | # MAGIC ## Configure the SDK |
59 | | -# MAGIC |
| 60 | +# MAGIC |
60 | 61 | # MAGIC Now that Labelbox and the Databricks libraries have been installed, you will need to configure the SDK. You will need an API key that you can create through the app [here](https://app.labelbox.com/account/api-keys). You can also store the key using Databricks Secrets API. The SDK will attempt to use the env var `LABELBOX_API_KEY` |
61 | 62 |
|
62 | 63 | # COMMAND ---------- |
|
65 | 66 | from labelbox.schema.ontology import OntologyBuilder, Tool, Classification, Option |
66 | 67 | import labelspark |
67 | 68 |
|
68 | | -API_KEY = "" |
| 69 | +API_KEY = "" |
| 70 | + |
| 71 | +if not (API_KEY): |
| 72 | + raise ValueError("Go to Labelbox to get an API key") |
69 | 73 |
|
70 | | -if not(API_KEY): |
71 | | - raise ValueError("Go to Labelbox to get an API key") |
72 | | - |
73 | 74 | client = Client(API_KEY) |
74 | 75 |
|
75 | 76 | # COMMAND ---------- |
76 | 77 |
|
77 | 78 | # MAGIC %md |
78 | 79 | # MAGIC ## Fetch seed data |
79 | | -# MAGIC |
| 80 | +# MAGIC |
80 | 81 | # MAGIC Next we'll load a demo dataset into a Spark table so you can see how to easily load assets into Labelbox via URL. For simplicity, you can get a Dataset ID from Labelbox and we'll load those URLs into a Spark table for you (so you don't need to worry about finding data to get this demo notebook to run). Below we'll grab the "Example Nature Dataset" included in Labelbox trials. |
81 | | -# MAGIC |
| 82 | +# MAGIC |
82 | 83 | # MAGIC Also, Labelbox has native support for AWS, Azure, and GCP cloud storage. You can connect Labelbox to your storage via [Delegated Access](https://docs.labelbox.com/docs/iam-delegated-access) and easily load those assets for annotation. For more information, you can watch this [video](https://youtu.be/wlWo6EmPDV4). |
83 | 84 |
|
84 | 85 | # COMMAND ---------- |
85 | 86 |
|
86 | | -sample_dataset = next(client.get_datasets(where=(Dataset.name == "Example Nature Dataset"))) |
| 87 | +sample_dataset = next( |
| 88 | + client.get_datasets(where=(Dataset.name == "Example Nature Dataset"))) |
87 | 89 | sample_dataset.uid |
88 | 90 |
|
89 | 91 | # COMMAND ---------- |
|
94 | 96 | tblList = spark.catalog.listTables() |
95 | 97 |
|
96 | 98 | if not any([table.name == SAMPLE_TABLE for table in tblList]): |
97 | | - |
98 | | - df = pd.DataFrame([ |
99 | | - { |
100 | | - "external_id": dr.external_id, |
101 | | - "row_data": dr.row_data |
102 | | - } for dr in sample_dataset.data_rows() |
103 | | - ]).to_spark() |
104 | | - df.registerTempTable(SAMPLE_TABLE) |
105 | | - print(f"Registered table: {SAMPLE_TABLE}") |
| 99 | + |
| 100 | + df = pd.DataFrame([{ |
| 101 | + "external_id": dr.external_id, |
| 102 | + "row_data": dr.row_data |
| 103 | + } for dr in sample_dataset.data_rows()]).to_spark() |
| 104 | + df.registerTempTable(SAMPLE_TABLE) |
| 105 | + print(f"Registered table: {SAMPLE_TABLE}") |
106 | 106 |
|
107 | 107 | # COMMAND ---------- |
108 | 108 |
|
|
117 | 117 |
|
118 | 118 | # MAGIC %md |
119 | 119 | # MAGIC ## Create a Labeling Project |
120 | | -# MAGIC |
| 120 | +# MAGIC |
121 | 121 | # MAGIC Projects are where teams create labels. A project is requires a dataset of assets to be labeled and an ontology to configure the labeling interface. |
122 | | -# MAGIC |
| 122 | +# MAGIC |
123 | 123 | # MAGIC ### Step 1: Create a dataaset |
124 | | -# MAGIC |
| 124 | +# MAGIC |
125 | 125 | # MAGIC The [Labelbox Connector for Databricks](https://pypi.org/project/labelspark/) expects a spark table with two columns; the first column "external_id" and second column "row_data" |
126 | | -# MAGIC |
| 126 | +# MAGIC |
127 | 127 | # MAGIC external_id is a filename, like "birds.jpg" or "my_video.mp4" |
128 | | -# MAGIC |
129 | | -# MAGIC row_data is the URL path to the file. Labelbox renders assets locally on your users' machines when they label, so your labeler will need permission to access that asset. |
130 | | -# MAGIC |
131 | | -# MAGIC Example: |
132 | | -# MAGIC |
| 128 | +# MAGIC |
| 129 | +# MAGIC row_data is the URL path to the file. Labelbox renders assets locally on your users' machines when they label, so your labeler will need permission to access that asset. |
| 130 | +# MAGIC |
| 131 | +# MAGIC Example: |
| 132 | +# MAGIC |
133 | 133 | # MAGIC | external_id | row_data | |
134 | 134 | # MAGIC |-------------|--------------------------------------| |
135 | 135 | # MAGIC | image1.jpg | https://url_to_your_asset/image1.jpg | |
|
140 | 140 |
|
141 | 141 | unstructured_data = spark.table(SAMPLE_TABLE) |
142 | 142 |
|
143 | | -demo_dataset = labelspark.create_dataset(client, unstructured_data, name = "Databricks Demo Dataset") |
| 143 | +demo_dataset = labelspark.create_dataset(client, |
| 144 | + unstructured_data, |
| 145 | + name="Databricks Demo Dataset") |
144 | 146 |
|
145 | 147 | # COMMAND ---------- |
146 | 148 |
|
|
151 | 153 |
|
152 | 154 | # MAGIC %md |
153 | 155 | # MAGIC ### Step 2: Create a project |
154 | | -# MAGIC |
| 156 | +# MAGIC |
155 | 157 | # MAGIC You can use the labelbox SDK to build your ontology (we'll do that next) You can also set your project up entirely through our website at app.labelbox.com. |
156 | | -# MAGIC |
| 158 | +# MAGIC |
157 | 159 | # MAGIC Check out our [ontology creation documentation.](https://docs.labelbox.com/docs/configure-ontology) |
158 | 160 |
|
159 | 161 | # COMMAND ---------- |
|
165 | 167 | ontology = OntologyBuilder() |
166 | 168 |
|
167 | 169 | tools = [ |
168 | | - Tool(tool=Tool.Type.BBOX, name="Frog"), |
169 | | - Tool(tool=Tool.Type.BBOX, name="Flower"), |
170 | | - Tool(tool=Tool.Type.BBOX, name="Fruit"), |
171 | | - Tool(tool=Tool.Type.BBOX, name="Plant"), |
172 | | - Tool(tool=Tool.Type.SEGMENTATION, name="Bird"), |
173 | | - Tool(tool=Tool.Type.SEGMENTATION, name="Person"), |
174 | | - Tool(tool=Tool.Type.SEGMENTATION, name="Sleep"), |
175 | | - Tool(tool=Tool.Type.SEGMENTATION, name="Yak"), |
176 | | - Tool(tool=Tool.Type.SEGMENTATION, name="Gemstone"), |
| 170 | + Tool(tool=Tool.Type.BBOX, name="Frog"), |
| 171 | + Tool(tool=Tool.Type.BBOX, name="Flower"), |
| 172 | + Tool(tool=Tool.Type.BBOX, name="Fruit"), |
| 173 | + Tool(tool=Tool.Type.BBOX, name="Plant"), |
| 174 | + Tool(tool=Tool.Type.SEGMENTATION, name="Bird"), |
| 175 | + Tool(tool=Tool.Type.SEGMENTATION, name="Person"), |
| 176 | + Tool(tool=Tool.Type.SEGMENTATION, name="Sleep"), |
| 177 | + Tool(tool=Tool.Type.SEGMENTATION, name="Yak"), |
| 178 | + Tool(tool=Tool.Type.SEGMENTATION, name="Gemstone"), |
177 | 179 | ] |
178 | | -for tool in tools: |
179 | | - ontology.add_tool(tool) |
| 180 | +for tool in tools: |
| 181 | + ontology.add_tool(tool) |
180 | 182 |
|
181 | 183 | conditions = ["clear", "overcast", "rain", "other"] |
182 | 184 |
|
183 | 185 | weather_classification = Classification( |
184 | 186 | class_type=Classification.Type.RADIO, |
185 | | - instructions="what is the weather?", |
186 | | - options=[Option(value=c) for c in conditions] |
187 | | -) |
| 187 | + instructions="what is the weather?", |
| 188 | + options=[Option(value=c) for c in conditions]) |
188 | 189 | ontology.add_classification(weather_classification) |
189 | 190 |
|
190 | | - |
191 | 191 | # Setup editor |
192 | 192 | for editor in client.get_labeling_frontends(): |
193 | 193 | if editor.name == 'Editor': |
194 | | - project_demo.setup(editor, ontology.asdict()) |
| 194 | + project_demo.setup(editor, ontology.asdict()) |
195 | 195 |
|
196 | 196 | print("Project Setup is complete.") |
197 | 197 |
|
|
213 | 213 |
|
214 | 214 | # MAGIC %md |
215 | 215 | # MAGIC ##Exporting labels/annotations |
216 | | -# MAGIC |
| 216 | +# MAGIC |
217 | 217 | # MAGIC After creating labels in Labelbox you can export them to use in Databricks for model training and analysis. |
218 | 218 |
|
219 | 219 | # COMMAND ---------- |
|
230 | 230 |
|
231 | 231 | # MAGIC %md |
232 | 232 | # MAGIC ## Other features of Labelbox |
233 | | -# MAGIC |
| 233 | +# MAGIC |
234 | 234 | # MAGIC <h3> [Model Assisted Labeling](https://docs.labelbox.com/docs/model-assisted-labeling) </h3> |
235 | 235 | # MAGIC Once you train a model on your initial set of unstructured data, you can plug that model into Labelbox to support a Model Assisted Labeling workflow. Review the outputs of your model, make corrections, and retrain with ease! You can reduce future labeling costs by >50% by leveraging model assisted labeling. |
236 | | -# MAGIC |
| 236 | +# MAGIC |
237 | 237 | # MAGIC <img src="https://files.readme.io/4c65e12-model-assisted-labeling.png" alt="MAL" width="800"/> |
238 | | -# MAGIC |
| 238 | +# MAGIC |
239 | 239 | # MAGIC <h3> [Catalog](https://docs.labelbox.com/docs/catalog) </h3> |
240 | | -# MAGIC Once you've created datasets and annotations in Labelbox, you can easily browse your datasets and curate new ones in Catalog. Use your model embeddings to find images by similarity search. |
241 | | -# MAGIC |
| 240 | +# MAGIC Once you've created datasets and annotations in Labelbox, you can easily browse your datasets and curate new ones in Catalog. Use your model embeddings to find images by similarity search. |
| 241 | +# MAGIC |
242 | 242 | # MAGIC <img src="https://files.readme.io/14f82d4-catalog-marketing.jpg" alt="Catalog" width="800"/> |
243 | | -# MAGIC |
| 243 | +# MAGIC |
244 | 244 | # MAGIC <h3> [Model Diagnostics](https://labelbox.com/product/model-diagnostics) </h3> |
245 | | -# MAGIC Labelbox complements your MLFlow experiment tracking with the ability to easily visualize experiment predictions at scale. Model Diagnostics helps you quickly identify areas where your model is weak so you can collect the right data and refine the next model iteration. |
246 | | -# MAGIC |
| 245 | +# MAGIC Labelbox complements your MLFlow experiment tracking with the ability to easily visualize experiment predictions at scale. Model Diagnostics helps you quickly identify areas where your model is weak so you can collect the right data and refine the next model iteration. |
| 246 | +# MAGIC |
247 | 247 | # MAGIC <img src="https://images.ctfassets.net/j20krz61k3rk/4LfIELIjpN6cou4uoFptka/20cbdc38cc075b82f126c2c733fb7082/identify-patterns-in-your-model-behavior.png" alt="Diagnostics" width="800"/> |
248 | 248 |
|
249 | 249 | # COMMAND ---------- |
250 | 250 |
|
251 | 251 | # DBTITLE 1,More Info |
252 | 252 | # MAGIC %md |
253 | | -# MAGIC While using the Labelbox Connector for Databricks, you will likely use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK: |
| 253 | +# MAGIC While using the Labelbox Connector for Databricks, you will likely use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK: |
254 | 254 | # MAGIC * [Visit our docs](https://labelbox.com/docs/python-api) to learn how the SDK works |
255 | 255 | # MAGIC * Checkout our [notebook examples](https://github.com/Labelbox/labelspark/tree/master/notebooks) to follow along with interactive tutorials |
256 | 256 | # MAGIC * view our [API reference](https://labelbox.com/docs/python-api/api-reference). |
257 | | -# MAGIC |
| 257 | +# MAGIC |
258 | 258 | # MAGIC <h4>Questions or comments? Reach out to us at [ecosystem+databricks@labelbox.com](mailto:ecosystem+databricks@labelbox.com) |
259 | 259 |
|
260 | 260 | # COMMAND ---------- |
261 | 261 |
|
262 | 262 | # MAGIC %md |
263 | 263 | # MAGIC Copyright Labelbox, Inc. 2021. The source in this notebook is provided subject to the [Labelbox Terms of Service](https://docs.labelbox.com/page/terms-of-service). All included or referenced third party libraries are subject to the licenses set forth below. |
264 | | -# MAGIC |
| 264 | +# MAGIC |
265 | 265 | # MAGIC |Library Name|Library license | Library License URL | Library Source URL | |
266 | 266 | # MAGIC |---|---|---|---| |
267 | 267 | # MAGIC |Labelbox Python SDK|Apache-2.0 License |https://github.com/Labelbox/labelbox-python/blob/develop/LICENSE|https://github.com/Labelbox/labelbox-python |
|
0 commit comments