Skip to content

Conversation

@AnonymDevOSS
Copy link

Implements pipeline with bounded queues to overlap decode, compute and encode. Reduces I/O stalls.

Description

Process a video using a threaded pipeline that asynchronously
reads frames, applies a callback to each, and writes the results
to an output file.

This function implements a three-stage pipeline designed to maximize
frame throughput.

  • Reader thread: reads frames from disk into a bounded queue ('read_q')
    until full, then blocks. This ensures we never load more than 'prefetch'
    frames into memory at once.

  • Main thread: dequeues frames, applies the 'callback(frame, idx)',
    and enqueues the processed result into 'write_q'.
    This is the compute stage. It's important to note that it's not threaded,
    so you can safely use any detectors, trackers, or other stateful objects
    without synchronization issues.

  • Writer thread: dequeues frames and writes them to disk.

Both queues are bounded to enforce back-pressure:

  • The reader cannot outpace processing
  • The processor cannot outpace writing

Summary:

  • It's thread-safe: because the callback runs only in the main thread,
    using a single stateful detector/tracker inside callback does not require
    synchronization with the reader/writer threads.

  • While the main thread processes frame N, the reader is already decoding frame N+1,
    and the writer is encoding frame N-1. They operate concurrently without blocking
    each other.

Type of change

Please delete options that are not relevant.

  • [x ] New feature (non-breaking change which adds functionality)
  • [x ] This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

I created a benchmark script to measure the performance impact of these changes. (benchmark_process_video.py;
full_results.txt)

I created 3 functions to benchmark: opencv (short), opencv (long), tracker; benchmarking current process_video and new process_video_threads.

Results below, 5 executions for each case:

OpenCV (short)
process_video_threads       avg=  2.338s  stdev= 0.200s  min= 2.137s  max= 2.620s
process_video               avg=  3.560s  stdev= 0.404s  min= 3.219s  max= 4.249s

OpenCV (long) 
process_video_threads       avg= 18.449s  stdev= 1.426s  min=17.067s  max=20.863s
process_video               avg= 28.067s  stdev= 1.345s  min=26.261s  max=29.373s

Tracker.
process_video_threads       avg= 21.481s  stdev= 0.593s  min=20.825s  max=22.205s
process_video               avg= 24.929s  stdev= 0.368s  min=24.464s  max=25.307s

Initially, I explored using threads and processes to parallelize process_video (I can push some of those prototypes if needed), but this design wasn’t thread-safe for stateful callbacks (e.g. trackers) and showed little improvement in profiling; most of the total time was spent on disk I/O rather than computation.

This optimization instead focuses on improving the I/O path, yielding a more generic and safe performance gain.

@CLAassistant
Copy link

CLAassistant commented Oct 26, 2025

CLA assistant check
All committers have signed the CLA.

@AnonymDevOSS AnonymDevOSS force-pushed the feat/threads-queues-video-process branch from 4c2a8d0 to 90d47de Compare October 28, 2025 12:56
Implements pipeline with bounded queues to overlap decode,
compute and encode. Reduces I/O stalls.
@AnonymDevOSS AnonymDevOSS force-pushed the feat/threads-queues-video-process branch from 90d47de to 03f6239 Compare October 28, 2025 13:06
@Ashp116
Copy link
Contributor

Ashp116 commented Nov 5, 2025

Hey @AnonymDevOSS,

This PR pretty good. I think there are plans to create new video API #1924. What are your thoughts on adding threading for the new Video API?

@AnonymDevOSS
Copy link
Author

Sure, I'll take a look at it this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants