feat: add threaded I/O pipeline for video processing #1997
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements pipeline with bounded queues to overlap decode, compute and encode. Reduces I/O stalls.
Description
Process a video using a threaded pipeline that asynchronously
reads frames, applies a callback to each, and writes the results
to an output file.
This function implements a three-stage pipeline designed to maximize
frame throughput.
Reader thread: reads frames from disk into a bounded queue ('read_q')
until full, then blocks. This ensures we never load more than 'prefetch'
frames into memory at once.
Main thread: dequeues frames, applies the 'callback(frame, idx)',
and enqueues the processed result into 'write_q'.
This is the compute stage. It's important to note that it's not threaded,
so you can safely use any detectors, trackers, or other stateful objects
without synchronization issues.
Writer thread: dequeues frames and writes them to disk.
Both queues are bounded to enforce back-pressure:
Summary:
It's thread-safe: because the callback runs only in the main thread,
using a single stateful detector/tracker inside callback does not require
synchronization with the reader/writer threads.
While the main thread processes frame N, the reader is already decoding frame N+1,
and the writer is encoding frame N-1. They operate concurrently without blocking
each other.
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
I created a benchmark script to measure the performance impact of these changes. (benchmark_process_video.py;
full_results.txt)
I created 3 functions to benchmark: opencv (short), opencv (long), tracker; benchmarking current process_video and new process_video_threads.
Results below, 5 executions for each case:
Initially, I explored using threads and processes to parallelize process_video (I can push some of those prototypes if needed), but this design wasn’t thread-safe for stateful callbacks (e.g. trackers) and showed little improvement in profiling; most of the total time was spent on disk I/O rather than computation.
This optimization instead focuses on improving the I/O path, yielding a more generic and safe performance gain.