|
111 | 111 | "\n", |
112 | 112 | "### Audio Classification Annotations\n", |
113 | 113 | "\n", |
114 | | - "Use `AudioClassificationAnnotation` for classifications tied to specific time ranges.\n" |
| 114 | + "Use `AudioClassificationAnnotation` for classifications tied to specific time ranges. The interface now accepts milliseconds directly for precise timing control.\n" |
115 | 115 | ] |
116 | 116 | }, |
117 | 117 | { |
|
122 | 122 | "source": [ |
123 | 123 | "# Speaker identification for a time range\n", |
124 | 124 | "speaker_annotation = lb_types.AudioClassificationAnnotation.from_time_range(\n", |
125 | | - " start_sec=2.5, # Start at 2.5 seconds\n", |
126 | | - " end_sec=4.1, # End at 4.1 seconds\n", |
| 125 | + " start_ms=2500, # Start at 2500 milliseconds (2.5 seconds)\n", |
| 126 | + " end_ms=4100, # End at 4100 milliseconds (4.1 seconds)\n", |
127 | 127 | " name=\"speaker_id\",\n", |
128 | 128 | " value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name=\"john\"))\n", |
129 | 129 | ")\n", |
|
140 | 140 | "source": [ |
141 | 141 | "# Audio quality assessment for a segment\n", |
142 | 142 | "quality_annotation = lb_types.AudioClassificationAnnotation.from_time_range(\n", |
143 | | - " start_sec=0.0,\n", |
144 | | - " end_sec=10.0,\n", |
| 143 | + " start_ms=0,\n", |
| 144 | + " end_ms=10000,\n", |
145 | 145 | " name=\"audio_quality\",\n", |
146 | 146 | " value=lb_types.Checklist(answer=[\n", |
147 | 147 | " lb_types.ClassificationAnswer(name=\"clear_audio\"),\n", |
|
151 | 151 | "\n", |
152 | 152 | "# Emotion detection for a segment\n", |
153 | 153 | "emotion_annotation = lb_types.AudioClassificationAnnotation.from_time_range(\n", |
154 | | - " start_sec=5.2,\n", |
155 | | - " end_sec=8.7,\n", |
| 154 | + " start_ms=5200,\n", |
| 155 | + " end_ms=8700,\n", |
156 | 156 | " name=\"emotion\",\n", |
157 | 157 | " value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name=\"happy\"))\n", |
158 | 158 | ")\n" |
|
164 | 164 | "source": [ |
165 | 165 | "### Audio Object Annotations\n", |
166 | 166 | "\n", |
167 | | - "Use `AudioObjectAnnotation` for text entities like transcriptions tied to specific time ranges.\n" |
| 167 | + "Use `AudioObjectAnnotation` for text entities like transcriptions tied to specific time ranges. The interface now accepts milliseconds directly for precise timing control.\n" |
168 | 168 | ] |
169 | 169 | }, |
170 | 170 | { |
|
175 | 175 | "source": [ |
176 | 176 | "# Transcription with precise timestamps\n", |
177 | 177 | "transcription_annotation = lb_types.AudioObjectAnnotation.from_time_range(\n", |
178 | | - " start_sec=2.5,\n", |
179 | | - " end_sec=4.1,\n", |
| 178 | + " start_ms=2500,\n", |
| 179 | + " end_ms=4100,\n", |
180 | 180 | " name=\"transcription\",\n", |
181 | 181 | " value=lb_types.TextEntity(text=\"Hello, how are you doing today?\")\n", |
182 | 182 | ")\n", |
|
193 | 193 | "source": [ |
194 | 194 | "# Sound event detection\n", |
195 | 195 | "sound_event_annotation = lb_types.AudioObjectAnnotation.from_time_range(\n", |
196 | | - " start_sec=10.0,\n", |
197 | | - " end_sec=12.5,\n", |
| 196 | + " start_ms=10000,\n", |
| 197 | + " end_ms=12500,\n", |
198 | 198 | " name=\"sound_event\",\n", |
199 | 199 | " value=lb_types.TextEntity(text=\"Dog barking in background\")\n", |
200 | 200 | ")\n", |
201 | 201 | "\n", |
202 | 202 | "# Multiple transcription segments\n", |
203 | 203 | "transcription_segments = [\n", |
204 | 204 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
205 | | - " start_sec=0.0, end_sec=2.3,\n", |
| 205 | + " start_ms=0, end_ms=2300,\n", |
206 | 206 | " name=\"transcription\",\n", |
207 | 207 | " value=lb_types.TextEntity(text=\"Welcome to our podcast.\")\n", |
208 | 208 | " ),\n", |
209 | 209 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
210 | | - " start_sec=2.5, end_sec=5.8,\n", |
| 210 | + " start_ms=2500, end_ms=5800,\n", |
211 | 211 | " name=\"transcription\", \n", |
212 | 212 | " value=lb_types.TextEntity(text=\"Today we're discussing AI advancements.\")\n", |
213 | 213 | " ),\n", |
214 | 214 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
215 | | - " start_sec=6.0, end_sec=9.2,\n", |
| 215 | + " start_ms=6000, end_ms=9200,\n", |
216 | 216 | " name=\"transcription\",\n", |
217 | 217 | " value=lb_types.TextEntity(text=\"Let's start with machine learning basics.\")\n", |
218 | 218 | " )\n", |
|
238 | 238 | "podcast_annotations = [\n", |
239 | 239 | " # Host introduction\n", |
240 | 240 | " lb_types.AudioClassificationAnnotation.from_time_range(\n", |
241 | | - " start_sec=0.0, end_sec=5.0,\n", |
| 241 | + " start_ms=0, end_ms=5000,\n", |
242 | 242 | " name=\"speaker_id\",\n", |
243 | 243 | " value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name=\"host\"))\n", |
244 | 244 | " ),\n", |
245 | 245 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
246 | | - " start_sec=0.0, end_sec=5.0,\n", |
| 246 | + " start_ms=0, end_ms=5000,\n", |
247 | 247 | " name=\"transcription\",\n", |
248 | 248 | " value=lb_types.TextEntity(text=\"Welcome to Tech Talk, I'm your host Sarah.\")\n", |
249 | 249 | " ),\n", |
250 | 250 | " \n", |
251 | 251 | " # Guest response\n", |
252 | 252 | " lb_types.AudioClassificationAnnotation.from_time_range(\n", |
253 | | - " start_sec=5.2, end_sec=8.5,\n", |
| 253 | + " start_ms=5200, end_ms=8500,\n", |
254 | 254 | " name=\"speaker_id\",\n", |
255 | 255 | " value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name=\"guest\"))\n", |
256 | 256 | " ),\n", |
257 | 257 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
258 | | - " start_sec=5.2, end_sec=8.5,\n", |
| 258 | + " start_ms=5200, end_ms=8500,\n", |
259 | 259 | " name=\"transcription\",\n", |
260 | 260 | " value=lb_types.TextEntity(text=\"Thanks for having me, Sarah!\")\n", |
261 | 261 | " ),\n", |
262 | 262 | " \n", |
263 | 263 | " # Audio quality assessment\n", |
264 | 264 | " lb_types.AudioClassificationAnnotation.from_time_range(\n", |
265 | | - " start_sec=0.0, end_sec=10.0,\n", |
| 265 | + " start_ms=0, end_ms=10000,\n", |
266 | 266 | " name=\"audio_quality\",\n", |
267 | 267 | " value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name=\"excellent\"))\n", |
268 | 268 | " )\n", |
|
288 | 288 | "call_center_annotations = [\n", |
289 | 289 | " # Customer sentiment analysis\n", |
290 | 290 | " lb_types.AudioClassificationAnnotation.from_time_range(\n", |
291 | | - " start_sec=0.0, end_sec=30.0,\n", |
| 291 | + " start_ms=0, end_ms=30000,\n", |
292 | 292 | " name=\"customer_sentiment\",\n", |
293 | 293 | " value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name=\"frustrated\"))\n", |
294 | 294 | " ),\n", |
295 | 295 | " \n", |
296 | 296 | " # Agent performance\n", |
297 | 297 | " lb_types.AudioClassificationAnnotation.from_time_range(\n", |
298 | | - " start_sec=30.0, end_sec=60.0,\n", |
| 298 | + " start_ms=30000, end_ms=60000,\n", |
299 | 299 | " name=\"agent_performance\",\n", |
300 | 300 | " value=lb_types.Checklist(answer=[\n", |
301 | 301 | " lb_types.ClassificationAnswer(name=\"professional_tone\"),\n", |
|
306 | 306 | " \n", |
307 | 307 | " # Key phrases extraction\n", |
308 | 308 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
309 | | - " start_sec=15.0, end_sec=18.0,\n", |
| 309 | + " start_ms=15000, end_ms=18000,\n", |
310 | 310 | " name=\"key_phrase\",\n", |
311 | 311 | " value=lb_types.TextEntity(text=\"I want to speak to your manager\")\n", |
312 | 312 | " ),\n", |
313 | 313 | " \n", |
314 | 314 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
315 | | - " start_sec=45.0, end_sec=48.0,\n", |
| 315 | + " start_ms=45000, end_ms=48000,\n", |
316 | 316 | " name=\"key_phrase\",\n", |
317 | 317 | " value=lb_types.TextEntity(text=\"Thank you for your patience\")\n", |
318 | 318 | " )\n", |
|
338 | 338 | "music_annotations = [\n", |
339 | 339 | " # Musical instruments\n", |
340 | 340 | " lb_types.AudioClassificationAnnotation.from_time_range(\n", |
341 | | - " start_sec=0.0, end_sec=30.0,\n", |
| 341 | + " start_ms=0, end_ms=30000,\n", |
342 | 342 | " name=\"instruments\",\n", |
343 | 343 | " value=lb_types.Checklist(answer=[\n", |
344 | 344 | " lb_types.ClassificationAnswer(name=\"piano\"),\n", |
|
349 | 349 | " \n", |
350 | 350 | " # Genre classification\n", |
351 | 351 | " lb_types.AudioClassificationAnnotation.from_time_range(\n", |
352 | | - " start_sec=0.0, end_sec=60.0,\n", |
| 352 | + " start_ms=0, end_ms=60000,\n", |
353 | 353 | " name=\"genre\",\n", |
354 | 354 | " value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name=\"classical\"))\n", |
355 | 355 | " ),\n", |
356 | 356 | " \n", |
357 | 357 | " # Sound events\n", |
358 | 358 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
359 | | - " start_sec=25.0, end_sec=27.0,\n", |
| 359 | + " start_ms=25000, end_ms=27000,\n", |
360 | 360 | " name=\"sound_event\",\n", |
361 | 361 | " value=lb_types.TextEntity(text=\"Applause from audience\")\n", |
362 | 362 | " ),\n", |
363 | 363 | " \n", |
364 | 364 | " lb_types.AudioObjectAnnotation.from_time_range(\n", |
365 | | - " start_sec=45.0, end_sec=46.5,\n", |
| 365 | + " start_ms=45000, end_ms=46500,\n", |
366 | 366 | " name=\"sound_event\",\n", |
367 | 367 | " value=lb_types.TextEntity(text=\"Door closing in background\")\n", |
368 | 368 | " )\n", |
|
681 | 681 | "\n", |
682 | 682 | "# Audio: 1 frame = 1 millisecond\n", |
683 | 683 | "audio_annotation = lb_types.AudioClassificationAnnotation.from_time_range(\n", |
684 | | - " start_sec=2.5, end_sec=4.1,\n", |
| 684 | + " start_ms=2500, end_ms=4100,\n", |
685 | 685 | " name=\"test\", value=lb_types.Text(answer=\"test\")\n", |
686 | 686 | ")\n", |
687 | 687 | "\n", |
688 | 688 | "print(f\"Audio Annotation:\")\n", |
689 | | - "print(f\" Time: 2.5s → Frame: {audio_annotation.frame} (milliseconds)\")\n", |
| 689 | + "print(f\" Time: 2500ms → Frame: {audio_annotation.frame} (milliseconds)\")\n", |
690 | 690 | "print(f\" Frame rate: 1000 frames/second (1 frame = 1ms)\")\n", |
691 | 691 | "\n", |
692 | 692 | "print(f\"\\nVideo Annotation (for comparison):\")\n", |
|
704 | 704 | "\n", |
705 | 705 | "### 1. Time Precision\n", |
706 | 706 | "- Audio temporal annotations use millisecond precision (1 frame = 1ms)\n", |
707 | | - "- Always use the `from_time_range()` method for user-friendly second-based input\n", |
708 | | - "- Frame values are automatically calculated: `frame = int(start_sec * 1000)`\n", |
| 707 | + "- Use the `from_time_range()` method with millisecond-based input for precise timing control\n", |
| 708 | + "- Frame values are set directly: `frame = start_ms`\n", |
709 | 709 | "\n", |
710 | 710 | "### 2. Ontology Alignment\n", |
711 | 711 | "- Ensure annotation `name` fields match your ontology tool/classification names\n", |
|
751 | 751 | "This notebook demonstrated:\n", |
752 | 752 | "\n", |
753 | 753 | "1. **Creating temporal audio annotations** using `AudioClassificationAnnotation` and `AudioObjectAnnotation`\n", |
754 | | - "2. **Time-based API** with `from_time_range()` for user-friendly input\n", |
| 754 | + "2. **Millisecond-based API** with `from_time_range()` for precise timing control\n", |
755 | 755 | "3. **Multiple use cases**: podcasts, call centers, music analysis\n", |
756 | 756 | "4. **MAL import pipeline** for uploading temporal prelabels\n", |
757 | 757 | "5. **NDJSON serialization** compatible with existing video infrastructure\n", |
|
762 | 762 | "- **Frame-based precision** - 1ms accuracy for audio timing\n", |
763 | 763 | "- **Seamless integration** - works with existing MAL and Label Import pipelines\n", |
764 | 764 | "- **Flexible annotation types** - supports classifications and text entities with timestamps\n", |
| 765 | + "- **Direct millisecond input** - precise timing control without conversion overhead\n", |
765 | 766 | "\n", |
766 | 767 | "### Next Steps:\n", |
767 | 768 | "1. Upload your temporal audio annotations using this notebook as a template\n", |
|
0 commit comments