Comment

NicolasHug · NicolasHug · commit 5d194e53f9db · 2025-10-01T18:07:24.000+01:00
diff --git a/src/torchcodec/_core/BetaCudaDeviceInterface.cpp b/src/torchcodec/_core/BetaCudaDeviceInterface.cpp
@@ -533,6 +533,11 @@ void BetaCudaDeviceInterface::FrameBuffer::markSlotReadyAndSetInfo(
       slotId,
       ". This should never happen.");
 
+  TORCH_CHECK(
+      it->second.state == SlotState::BEING_DECODED,
+      "Slot ",
+      slotId,
+      " is not in BEING_DECODED state. This should never happen.");
   it->second.state = SlotState::READY_FOR_OUTPUT;
   it->second.dispInfo = *dispInfo;
 }
diff --git a/src/torchcodec/_core/BetaCudaDeviceInterface.h b/src/torchcodec/_core/BetaCudaDeviceInterface.h
@@ -94,6 +94,7 @@ class BetaCudaDeviceInterface : public DeviceInterface {
     }
 
    private:
+    // Map of slotId to Slot
     std::unordered_map<int, Slot> map_;
   };
 
@@ -125,3 +126,74 @@ class BetaCudaDeviceInterface : public DeviceInterface {
 };
 
 } // namespace facebook::torchcodec
+
+// Note: [sendPacket, receiveFrame, frame ordering and NVCUVID callbacks]
+//
+// At a high level, this decoding interface mimics the FFmpeg send/receive
+// architecture:
+// - sendPacket(AVPacket) sends an AVPacket from the FFmpeg demuxer to the
+//   NVCUVID parser.
+// - receiveFrame(AVFrame) is a non-blocking call:
+//   - if a frame is ready **in display order**, it must return it. By display
+//   order, we mean that receiveFrame() must return frames with increasing pts
+//   values when called successively.
+//   - if no frame is ready, it must return AVERROR(EAGAIN) to indicate the
+//   caller should send more packets.
+//
+// The rest of this note assumes you have a reasonable level of familiarity with
+// the sendPacket/receiveFrame calling pattern. If you don't, look up the core
+// decoding loop in SingleVideoDecoder.
+//
+// The frame re-ordering problem:
+// Depending on the codec and on the encoding parameters, a packet from a video
+// stream may contain exactly one frame, more than one frame, or a fraction of a
+// frame. And, there may be non-linear frame dependencies because of B-frames,
+// which need both past *and* future frames to be decoded. Consider the
+// following stream, with frames presented in display order: I0 B1 P2 B3 P4 ...
+// - I0 is an I-frame (also key frame, can be decoded independently)
+// - B1 is a B-frame (bi-directional) which needs both I0 and P2 to be decoded
+// - P2 is a P-frame (predicted frame) which only needs I0 to be decodec.
+//
+// Because B1 needs both I0 and P2 to be properly decoded, the decode order must
+// be: I0 P2 B1 P4 B3 ... which is different from the display order.
+//
+// We don't have to worry about the decode order: it's up to the parser to
+// figure that out. But we have to make sure that receiveFrame() returns frames
+// in display order.
+//
+// SendPacket(AVPacket)'s job is just to send the packet to the NVCUVID parser
+// by calling cuvidParseVideoData(packet). When cuvidParseVideoData(packet) is
+// called, it may trigger callbacks, particularly:
+// - frameReadyForDecoding(picParams)): triggered **in decode order** when the
+//   parser has accumulated enough data to decode a frame. We send that frame to
+//   the NVDEC hardware for **async** decoding. While that frame is being
+//   decoded, we store a light reference (a Slot) to that frame in the
+//   frameBuffer_, and mark that slot as BEING_DECODED. The value that uniquely
+//   identifies that frame in the frameBuffer_ is its "slotId", which is given
+//   to us by NVCUVID in the callback parameter: picParams->CurrPicIdx.
+// - frameReadyInDisplayOrder(dispInfo)): triggered **in display order** when a
+//   frame is ready to be "displayed" (returned). When it is triggered, we look
+//   up the corresponding frame/slot in the frameBuffer_, using
+//   dispInfo->picture_index to match it against a given BEING_DECODED slotId.
+//   We mark that frame/slot as READY_FOR_OUTPUT.
+//   Crucially, this callback also tells us the pts of that frame. We store
+//   the pts and other relevant info the slot.
+//
+// Said differently, from the perspective of the frameBuffer_, at any point in
+// time a slot/frame in the frameBuffer_ can be in 3 states:
+// - empty: no slot for that slotId exists in the frameBuffer_
+// - BEING_DECODED: frameReadyForDecoding was triggered for that frame, and the
+//   frame was sent to NVDEC for async decoding. We don't know its pts because
+//   the parser didn't trigger frameReadyInDisplayOrder() for that frame yet.
+// - READY_FOR_OUTPUT: frameReadyInDisplayOrder was triggered for that frame, it
+//   is decoded and ready to be mapped and returned. We know its pts. 
+//
+// Because frameReadyInDisplayOrder is triggered in display order, we know that
+// if a slot is READY_FOR_OUTPUT, then all frames with a lower pts are also
+// READY_FOR_OUTPUT, or already returned. So when receiveFrame() is called, we
+// just need to look for the READY_FOR_OUTPUT slot with the lowest pts, and
+// return that frame. This guarantees that receiveFrame() returns frames in
+// display order. If no slot is READY_FOR_OUTPUT, then we return EAGAIN to
+// indicate the caller should send more packets.
+//
+// Simple, innit?