Skip to content

Commit a3af14d

Browse files
author
Nick Lee
committed
Patched notebooks and added DBC for Databricks consumption, also ran YAPF on Python file
1 parent 6b147ff commit a3af14d

File tree

4 files changed

+111
-97
lines changed

4 files changed

+111
-97
lines changed
Binary file not shown.

examples/integrations/databricks/labelbox_databricks_example.html

Lines changed: 7 additions & 7 deletions
Large diffs are not rendered by default.

examples/integrations/databricks/labelbox_databricks_example.ipynb

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

examples/integrations/databricks/labelbox_databricks_example.py

Lines changed: 103 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -7,31 +7,31 @@
77
# MAGIC %md
88
# MAGIC #### Pre-requisites
99
# MAGIC 1. This tutorial notebook requires a Lablbox API Key. Please login to your [Labelbox Account](app.labelbox.com) and generate an [API Key](https://app.labelbox.com/account/api-keys)
10-
# MAGIC 2. A few cells below will install the Labelbox SDK and Connector Library. This install is notebook-scoped and will not affect the rest of your cluster.
11-
# MAGIC 3. Please make sure you are running at least the latest LTS version of Databricks.
12-
# MAGIC
10+
# MAGIC 2. A few cells below will install the Labelbox SDK and Connector Library. This install is notebook-scoped and will not affect the rest of your cluster.
11+
# MAGIC 3. Please make sure you are running at least the latest LTS version of Databricks.
12+
# MAGIC
1313
# MAGIC #### Notebook Preview
14-
# MAGIC This notebook will guide you through these steps:
15-
# MAGIC 1. Connect to Labelbox via the SDK
14+
# MAGIC This notebook will guide you through these steps:
15+
# MAGIC 1. Connect to Labelbox via the SDK
1616
# MAGIC 2. Create a labeling dataset from a table of unstructured data in Databricks
1717
# MAGIC 3. Programmatically set up an ontology and labeling project in Labelbox
18-
# MAGIC 4. Load Bronze and Silver annotation tables from an example labeled project
19-
# MAGIC 5. Additional cells describe how to handle video annotations and use Labelbox Diagnostics and Catalog
20-
# MAGIC
18+
# MAGIC 4. Load Bronze and Silver annotation tables from an example labeled project
19+
# MAGIC 5. Additional cells describe how to handle video annotations and use Labelbox Diagnostics and Catalog
20+
# MAGIC
2121
# MAGIC Additional documentation links are provided at the end of the notebook.
2222

2323
# COMMAND ----------
2424

2525
# MAGIC %md
26-
# MAGIC Thanks for trying out the Databricks and Labelbox Connector! You or someone from your organization signed up for a Labelbox trial through Databricks Partner Connect. This notebook was loaded into your Shared directory to help illustrate how Labelbox and Databricks can be used together to power unstructured data workflows.
27-
# MAGIC
28-
# MAGIC Labelbox can be used to rapidly annotate a variety of unstructured data from your Data Lake ([images](https://labelbox.com/product/image), [video](https://labelbox.com/product/video), [text](https://labelbox.com/product/text), and [geospatial tiled imagery](https://docs.labelbox.com/docs/tiled-imagery-editor)) and the Labelbox Connector for Databricks makes it easy to bring the annotations back into your Lakehouse environment for AI/ML and analytical workflows.
29-
# MAGIC
30-
# MAGIC If you would like to watch a video of the workflow, check out our [Data & AI Summit Demo](https://databricks.com/session_na21/productionizing-unstructured-data-for-ai-and-analytics).
31-
# MAGIC
32-
# MAGIC
26+
# MAGIC Thanks for trying out the Databricks and Labelbox Connector! You or someone from your organization signed up for a Labelbox trial through Databricks Partner Connect. This notebook was loaded into your Shared directory to help illustrate how Labelbox and Databricks can be used together to power unstructured data workflows.
27+
# MAGIC
28+
# MAGIC Labelbox can be used to rapidly annotate a variety of unstructured data from your Data Lake ([images](https://labelbox.com/product/image), [video](https://labelbox.com/product/video), [text](https://labelbox.com/product/text), and [geospatial tiled imagery](https://docs.labelbox.com/docs/tiled-imagery-editor)) and the Labelbox Connector for Databricks makes it easy to bring the annotations back into your Lakehouse environment for AI/ML and analytical workflows.
29+
# MAGIC
30+
# MAGIC If you would like to watch a video of the workflow, check out our [Data & AI Summit Demo](https://databricks.com/session_na21/productionizing-unstructured-data-for-ai-and-analytics).
31+
# MAGIC
32+
# MAGIC
3333
# MAGIC <img src="https://labelbox.com/static/images/partnerships/collab-chart.svg" alt="example-workflow" width="800"/>
34-
# MAGIC
34+
# MAGIC
3535
# MAGIC <h5>Questions or comments? Reach out to us at [ecosystem+databricks@labelbox.com](mailto:ecosystem+databricks@labelbox.com)
3636

3737
# COMMAND ----------
@@ -41,23 +41,22 @@
4141

4242
# COMMAND ----------
4343

44-
#This will import Koalas or Pandas-on-Spark based on your DBR version.
44+
#This will import Koalas or Pandas-on-Spark based on your DBR version.
4545
from pyspark import SparkContext
4646
from packaging import version
47-
4847
sc = SparkContext.getOrCreate()
4948
if version.parse(sc.version) < version.parse("3.2.0"):
50-
import databricks.koalas as pd
51-
needs_koalas = True
49+
import databricks.koalas as pd
50+
needs_koalas = True
5251
else:
53-
import pyspark.pandas as pd
54-
needs_koalas = False
52+
import pyspark.pandas as pd
53+
needs_koalas = False
5554

5655
# COMMAND ----------
5756

5857
# MAGIC %md
5958
# MAGIC ## Configure the SDK
60-
# MAGIC
59+
# MAGIC
6160
# MAGIC Now that Labelbox and the Databricks libraries have been installed, you will need to configure the SDK. You will need an API key that you can create through the app [here](https://app.labelbox.com/account/api-keys). You can also store the key using Databricks Secrets API. The SDK will attempt to use the env var `LABELBOX_API_KEY`
6261

6362
# COMMAND ----------
@@ -66,27 +65,48 @@
6665
from labelbox.schema.ontology import OntologyBuilder, Tool, Classification, Option
6766
import labelspark
6867

69-
API_KEY = ""
70-
71-
if not (API_KEY):
72-
raise ValueError("Go to Labelbox to get an API key")
68+
API_KEY = ""
7369

70+
if not(API_KEY):
71+
raise ValueError("Go to Labelbox to get an API key")
72+
7473
client = Client(API_KEY)
7574

7675
# COMMAND ----------
7776

7877
# MAGIC %md
79-
# MAGIC ## Fetch seed data
80-
# MAGIC
81-
# MAGIC Next we'll load a demo dataset into a Spark table so you can see how to easily load assets into Labelbox via URL. For simplicity, you can get a Dataset ID from Labelbox and we'll load those URLs into a Spark table for you (so you don't need to worry about finding data to get this demo notebook to run). Below we'll grab the "Example Nature Dataset" included in Labelbox trials.
82-
# MAGIC
78+
# MAGIC ## Create seed data
79+
# MAGIC
80+
# MAGIC Next we'll load a demo dataset into a Spark table so you can see how to easily load assets into Labelbox via URLs with the Labelbox Connector for Databricks.
81+
# MAGIC
8382
# MAGIC Also, Labelbox has native support for AWS, Azure, and GCP cloud storage. You can connect Labelbox to your storage via [Delegated Access](https://docs.labelbox.com/docs/iam-delegated-access) and easily load those assets for annotation. For more information, you can watch this [video](https://youtu.be/wlWo6EmPDV4).
83+
# MAGIC
84+
# MAGIC You can also add data to Labelbox [using the Labelbox SDK directly](https://docs.labelbox.com/docs/datasets-datarows). We recommend using the SDK if you have complicated dataset creation requirements (e.g. including metadata with your dataset) which aren't handled by the Labelbox Connector for Databricks.
8485

8586
# COMMAND ----------
8687

87-
sample_dataset = next(
88-
client.get_datasets(where=(Dataset.name == "Example Nature Dataset")))
89-
sample_dataset.uid
88+
sample_dataset_dict = {"external_id":["sample1.jpg",
89+
"sample2.jpg",
90+
"sample3.jpg",
91+
"sample4.jpg",
92+
"sample5.jpg",
93+
"sample6.jpg",
94+
"sample7.jpg",
95+
"sample8.jpg",
96+
"sample9.jpg",
97+
"sample10.jpg"],
98+
"row_data":["https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000247422.jpg",
99+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000484849.jpg",
100+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000215782.jpg",
101+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_val2014_000000312024.jpg",
102+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000486139.jpg",
103+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000302713.jpg",
104+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000523272.jpg",
105+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000094514.jpg",
106+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_val2014_000000050578.jpg",
107+
"https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000073727.jpg"]}
108+
109+
df = pd.DataFrame.from_dict(sample_dataset_dict).to_spark() #produces our demo Spark table of datarows for Labelbox
90110

91111
# COMMAND ----------
92112

@@ -96,18 +116,13 @@
96116
tblList = spark.catalog.listTables()
97117

98118
if not any([table.name == SAMPLE_TABLE for table in tblList]):
99-
100-
df = pd.DataFrame([{
101-
"external_id": dr.external_id,
102-
"row_data": dr.row_data
103-
} for dr in sample_dataset.data_rows()]).to_spark()
104-
df.registerTempTable(SAMPLE_TABLE)
105-
print(f"Registered table: {SAMPLE_TABLE}")
119+
df.createOrReplaceTempView(SAMPLE_TABLE)
120+
print(f"Registered table: {SAMPLE_TABLE}")
106121

107122
# COMMAND ----------
108123

109124
# MAGIC %md
110-
# MAGIC You should now have a temporary table "sample_unstructured_data" which includes the file names and URLs for some demo images. We're going to share this table with Labelbox using the Labelbox Connector for Databricks!
125+
# MAGIC You should now have a temporary table "sample_unstructured_data" which includes the file names and URLs for some demo images. We're going to use this table with Labelbox using the Labelbox Connector for Databricks!
111126

112127
# COMMAND ----------
113128

@@ -117,19 +132,19 @@
117132

118133
# MAGIC %md
119134
# MAGIC ## Create a Labeling Project
120-
# MAGIC
135+
# MAGIC
121136
# MAGIC Projects are where teams create labels. A project is requires a dataset of assets to be labeled and an ontology to configure the labeling interface.
122-
# MAGIC
137+
# MAGIC
123138
# MAGIC ### Step 1: Create a dataaset
124-
# MAGIC
139+
# MAGIC
125140
# MAGIC The [Labelbox Connector for Databricks](https://pypi.org/project/labelspark/) expects a spark table with two columns; the first column "external_id" and second column "row_data"
126-
# MAGIC
141+
# MAGIC
127142
# MAGIC external_id is a filename, like "birds.jpg" or "my_video.mp4"
128-
# MAGIC
129-
# MAGIC row_data is the URL path to the file. Labelbox renders assets locally on your users' machines when they label, so your labeler will need permission to access that asset.
130-
# MAGIC
131-
# MAGIC Example:
132-
# MAGIC
143+
# MAGIC
144+
# MAGIC row_data is the URL path to the file. Labelbox renders assets locally on your users' machines when they label, so your labeler will need permission to access that asset.
145+
# MAGIC
146+
# MAGIC Example:
147+
# MAGIC
133148
# MAGIC | external_id | row_data |
134149
# MAGIC |-------------|--------------------------------------|
135150
# MAGIC | image1.jpg | https://url_to_your_asset/image1.jpg |
@@ -140,9 +155,7 @@
140155

141156
unstructured_data = spark.table(SAMPLE_TABLE)
142157

143-
demo_dataset = labelspark.create_dataset(client,
144-
unstructured_data,
145-
name="Databricks Demo Dataset")
158+
demo_dataset = labelspark.create_dataset(client, unstructured_data, name = "Databricks Demo Dataset")
146159

147160
# COMMAND ----------
148161

@@ -153,9 +166,9 @@
153166

154167
# MAGIC %md
155168
# MAGIC ### Step 2: Create a project
156-
# MAGIC
169+
# MAGIC
157170
# MAGIC You can use the labelbox SDK to build your ontology (we'll do that next) You can also set your project up entirely through our website at app.labelbox.com.
158-
# MAGIC
171+
# MAGIC
159172
# MAGIC Check out our [ontology creation documentation.](https://docs.labelbox.com/docs/configure-ontology)
160173

161174
# COMMAND ----------
@@ -167,31 +180,32 @@
167180
ontology = OntologyBuilder()
168181

169182
tools = [
170-
Tool(tool=Tool.Type.BBOX, name="Frog"),
171-
Tool(tool=Tool.Type.BBOX, name="Flower"),
172-
Tool(tool=Tool.Type.BBOX, name="Fruit"),
173-
Tool(tool=Tool.Type.BBOX, name="Plant"),
174-
Tool(tool=Tool.Type.SEGMENTATION, name="Bird"),
175-
Tool(tool=Tool.Type.SEGMENTATION, name="Person"),
176-
Tool(tool=Tool.Type.SEGMENTATION, name="Sleep"),
177-
Tool(tool=Tool.Type.SEGMENTATION, name="Yak"),
178-
Tool(tool=Tool.Type.SEGMENTATION, name="Gemstone"),
183+
Tool(tool=Tool.Type.BBOX, name="Car"),
184+
Tool(tool=Tool.Type.BBOX, name="Flower"),
185+
Tool(tool=Tool.Type.BBOX, name="Fruit"),
186+
Tool(tool=Tool.Type.BBOX, name="Plant"),
187+
Tool(tool=Tool.Type.SEGMENTATION, name="Bird"),
188+
Tool(tool=Tool.Type.SEGMENTATION, name="Person"),
189+
Tool(tool=Tool.Type.SEGMENTATION, name="Dog"),
190+
Tool(tool=Tool.Type.SEGMENTATION, name="Gemstone"),
179191
]
180-
for tool in tools:
181-
ontology.add_tool(tool)
192+
for tool in tools:
193+
ontology.add_tool(tool)
182194

183195
conditions = ["clear", "overcast", "rain", "other"]
184196

185197
weather_classification = Classification(
186198
class_type=Classification.Type.RADIO,
187-
instructions="what is the weather?",
188-
options=[Option(value=c) for c in conditions])
199+
instructions="what is the weather?",
200+
options=[Option(value=c) for c in conditions]
201+
)
189202
ontology.add_classification(weather_classification)
190203

204+
191205
# Setup editor
192206
for editor in client.get_labeling_frontends():
193207
if editor.name == 'Editor':
194-
project_demo.setup(editor, ontology.asdict())
208+
project_demo.setup(editor, ontology.asdict())
195209

196210
print("Project Setup is complete.")
197211

@@ -213,7 +227,7 @@
213227

214228
# MAGIC %md
215229
# MAGIC ##Exporting labels/annotations
216-
# MAGIC
230+
# MAGIC
217231
# MAGIC After creating labels in Labelbox you can export them to use in Databricks for model training and analysis.
218232

219233
# COMMAND ----------
@@ -223,45 +237,45 @@
223237
# COMMAND ----------
224238

225239
labels_table = labelspark.get_annotations(client, project_demo.uid, spark, sc)
226-
labels_table.registerTempTable(LABEL_TABLE)
240+
labels_table.createOrReplaceTempView(LABEL_TABLE)
227241
display(labels_table)
228242

229243
# COMMAND ----------
230244

231245
# MAGIC %md
232246
# MAGIC ## Other features of Labelbox
233-
# MAGIC
234-
# MAGIC <h3> [Model Assisted Labeling](https://docs.labelbox.com/docs/model-assisted-labeling) </h3>
235-
# MAGIC Once you train a model on your initial set of unstructured data, you can plug that model into Labelbox to support a Model Assisted Labeling workflow. Review the outputs of your model, make corrections, and retrain with ease! You can reduce future labeling costs by >50% by leveraging model assisted labeling.
236-
# MAGIC
247+
# MAGIC
248+
# MAGIC [Model Assisted Labeling](https://docs.labelbox.com/docs/model-assisted-labeling)
249+
# MAGIC <br>Once you train a model on your initial set of unstructured data, you can plug that model into Labelbox to support a Model Assisted Labeling workflow. Review the outputs of your model, make corrections, and retrain with ease! You can reduce future labeling costs by >50% by leveraging model assisted labeling.
250+
# MAGIC
237251
# MAGIC <img src="https://files.readme.io/4c65e12-model-assisted-labeling.png" alt="MAL" width="800"/>
238-
# MAGIC
239-
# MAGIC <h3> [Catalog](https://docs.labelbox.com/docs/catalog) </h3>
240-
# MAGIC Once you've created datasets and annotations in Labelbox, you can easily browse your datasets and curate new ones in Catalog. Use your model embeddings to find images by similarity search.
241-
# MAGIC
252+
# MAGIC
253+
# MAGIC [Catalog](https://docs.labelbox.com/docs/catalog)
254+
# MAGIC <br>Once you've created datasets and annotations in Labelbox, you can easily browse your datasets and curate new ones in Catalog. Use your model embeddings to find images by similarity search.
255+
# MAGIC
242256
# MAGIC <img src="https://files.readme.io/14f82d4-catalog-marketing.jpg" alt="Catalog" width="800"/>
243-
# MAGIC
244-
# MAGIC <h3> [Model Diagnostics](https://labelbox.com/product/model-diagnostics) </h3>
245-
# MAGIC Labelbox complements your MLFlow experiment tracking with the ability to easily visualize experiment predictions at scale. Model Diagnostics helps you quickly identify areas where your model is weak so you can collect the right data and refine the next model iteration.
246-
# MAGIC
257+
# MAGIC
258+
# MAGIC [Model Diagnostics](https://labelbox.com/product/model-diagnostics)
259+
# MAGIC <br>Labelbox complements your MLFlow experiment tracking with the ability to easily visualize experiment predictions at scale. Model Diagnostics helps you quickly identify areas where your model is weak so you can collect the right data and refine the next model iteration.
260+
# MAGIC
247261
# MAGIC <img src="https://images.ctfassets.net/j20krz61k3rk/4LfIELIjpN6cou4uoFptka/20cbdc38cc075b82f126c2c733fb7082/identify-patterns-in-your-model-behavior.png" alt="Diagnostics" width="800"/>
248262

249263
# COMMAND ----------
250264

251265
# DBTITLE 1,More Info
252266
# MAGIC %md
253-
# MAGIC While using the Labelbox Connector for Databricks, you will likely use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK:
267+
# MAGIC While using the Labelbox Connector for Databricks, you will likely use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK:
254268
# MAGIC * [Visit our docs](https://labelbox.com/docs/python-api) to learn how the SDK works
255269
# MAGIC * Checkout our [notebook examples](https://github.com/Labelbox/labelspark/tree/master/notebooks) to follow along with interactive tutorials
256270
# MAGIC * view our [API reference](https://labelbox.com/docs/python-api/api-reference).
257-
# MAGIC
258-
# MAGIC <h4>Questions or comments? Reach out to us at [ecosystem+databricks@labelbox.com](mailto:ecosystem+databricks@labelbox.com)
271+
# MAGIC
272+
# MAGIC <b>Questions or comments? Reach out to us at [ecosystem+databricks@labelbox.com](mailto:ecosystem+databricks@labelbox.com)
259273

260274
# COMMAND ----------
261275

262276
# MAGIC %md
263-
# MAGIC Copyright Labelbox, Inc. 2021. The source in this notebook is provided subject to the [Labelbox Terms of Service](https://docs.labelbox.com/page/terms-of-service). All included or referenced third party libraries are subject to the licenses set forth below.
264-
# MAGIC
277+
# MAGIC Copyright Labelbox, Inc. 2022. The source in this notebook is provided subject to the [Labelbox Terms of Service](https://docs.labelbox.com/page/terms-of-service). All included or referenced third party libraries are subject to the licenses set forth below.
278+
# MAGIC
265279
# MAGIC |Library Name|Library license | Library License URL | Library Source URL |
266280
# MAGIC |---|---|---|---|
267281
# MAGIC |Labelbox Python SDK|Apache-2.0 License |https://github.com/Labelbox/labelbox-python/blob/develop/LICENSE|https://github.com/Labelbox/labelbox-python

0 commit comments

Comments
 (0)