You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"### Select all data rows from the dataset created earlier that will be added to the batch.\n"
143
+
"## Create batches"
162
144
],
163
145
"cell_type": "markdown"
164
146
},
165
147
{
166
148
"metadata": {},
167
149
"source": [
168
-
"data_row_ids = [dr.uid for dr in dataset.export_data_rows()]\n",
169
-
"print(\"Number of data row ids:\", len(data_row_ids))"
150
+
"### Select all data rows from the dataset\n"
170
151
],
171
-
"cell_type": "code",
172
-
"outputs": [
173
-
{
174
-
"name": "stdout",
175
-
"output_type": "stream",
176
-
"text": [
177
-
"Number of data row ids: 8\n"
178
-
]
179
-
}
152
+
"cell_type": "markdown"
153
+
},
154
+
{
155
+
"metadata": {},
156
+
"source": [
157
+
"global_keys = [data_row.global_key for data_row in dataset.export_data_rows()]\n",
158
+
"print(\"Number of global keys:\", len(global_keys))"
180
159
],
160
+
"cell_type": "code",
161
+
"outputs": [],
181
162
"execution_count": null
182
163
},
183
164
{
184
165
"metadata": {},
185
166
"source": [
186
-
"## Select a random sample\n",
167
+
"### Select a random sample\n",
187
168
"This method is useful if you have large datasets and only want to work with a handful of data rows"
188
169
],
189
170
"cell_type": "markdown"
190
171
},
191
172
{
192
173
"metadata": {},
193
174
"source": [
194
-
"sample = random.sample(data_row_ids, 4)"
175
+
"sample = random.sample(global_keys, 4)"
195
176
],
196
177
"cell_type": "code",
197
178
"outputs": [],
@@ -200,83 +181,140 @@
200
181
{
201
182
"metadata": {},
202
183
"source": [
203
-
"# Batch Manipulation"
184
+
"### Create a batch\n",
185
+
"This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method."
204
186
],
205
187
"cell_type": "markdown"
206
188
},
207
189
{
208
190
"metadata": {},
209
191
"source": [
210
-
"### Create a Batch:\n"
192
+
"batch = project.create_batch(\n",
193
+
" name=\"Demo-First-Batch\", # Each batch in a project must have a unique name\n",
194
+
" global_keys=sample, # A list of data rows or data row ids\n",
195
+
" priority=5 # priority between 1(Highest) - 5(lowest)\n",
196
+
")\n",
197
+
"# number of data rows in the batch\n",
198
+
"print(\"Number of data rows in batch: \", batch.size)"
199
+
],
200
+
"cell_type": "code",
201
+
"outputs": [],
202
+
"execution_count": null
203
+
},
204
+
{
205
+
"metadata": {},
206
+
"source": [
207
+
"### Create multiple batches\n",
208
+
"The `project.create_batches()` method accepts up to 1 million data rows. Batches are chunked into groups of 100k if necessary, which is the maximum batch size. This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method.\n",
209
+
"\n",
210
+
"This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method. Batches will be created with the specified `name_prefix` argument and a unique suffix to ensure unique batch names. The suffix will be a 4-digit number starting at `0000`.\n",
211
+
"\n",
212
+
"For example, if the name prefix is `demo-create-batches-` and three batches are created, the names will be `demo-create-batches-0000`, `demo-create-batches-0001`, and `demo-create-batches-0002`. This method will throw an error if a batch with the same name already exists.\n",
213
+
"\n",
214
+
"In the code below, only one batch will be created, since we are only using the few data rows we created above. Creating over 100k data rows for this demonstration is not sensible, but this method is the preferred approach for batch creation as it will gracefully handle massive sets of data rows."
211
215
],
212
216
"cell_type": "markdown"
213
217
},
214
218
{
215
219
"metadata": {},
216
220
"source": [
217
-
"batch = project.create_batch(\n",
218
-
"\"Demo-First-Batch\", # Each batch in a project must have a unique name\n",
219
-
"sample, # A list of data rows or data row ids\n",
220
-
"5 # priority between 1(Highest) - 5(lowest)\n",
221
+
"# First, we must create a second project so that we can re-use the data rows we already created.\n",
222
+
"second_project = client.create_project(\n",
223
+
"name=\"Second-Demo-Batches-Project\", \n",
224
+
"media_type=lb.MediaType.Image\n",
221
225
")\n",
222
-
"# number of data rows in the batch\n",
223
-
"print(\"Number of data rows in batch: \", batch.size)"
"# Then, use the method that will create multiple batches if necessary.\n",
229
+
"task = second_project.create_batches(\n",
230
+
" name_prefix=\"demo-create-batches-\",\n",
231
+
" global_keys=global_keys,\n",
232
+
" priority=5\n",
233
+
")\n",
234
+
"\n",
235
+
"print(\"Errors: \", task.errors())\n",
236
+
"print(\"Result: \", task.result())"
224
237
],
225
238
"cell_type": "code",
226
-
"outputs": [
227
-
{
228
-
"name": "stdout",
229
-
"output_type": "stream",
230
-
"text": [
231
-
"Number of data rows in batch: 4\n"
232
-
]
233
-
}
239
+
"outputs": [],
240
+
"execution_count": null
241
+
},
242
+
{
243
+
"metadata": {},
244
+
"source": [
245
+
"### Create batches from a dataset\n",
246
+
"\n",
247
+
"If you wish to create batches in a project using all the data rows of a dataset, instead of having to gather global keys or ID and using subsets of data rows, you can use the `project.create_batches_from_dataset()` method. This method takes in a dataset ID and creates a batch (or batches if there are more than 100k data rows) comprised of all data rows not already in the project.\n",
248
+
"\n",
249
+
"The same logic applies to the `name_prefix` argument and the naming of batches as described in the section immediately above."
250
+
],
251
+
"cell_type": "markdown"
252
+
},
253
+
{
254
+
"metadata": {},
255
+
"source": [
256
+
"# First, we must create a third project so that we can re-use the data rows we already created.\n",
0 commit comments