@@ -94,6 +94,7 @@ class BetaCudaDeviceInterface : public DeviceInterface {
9494 }
9595
9696 private:
97+ // Map of slotId to Slot
9798 std::unordered_map<int , Slot> map_;
9899 };
99100
@@ -125,3 +126,74 @@ class BetaCudaDeviceInterface : public DeviceInterface {
125126};
126127
127128} // namespace facebook::torchcodec
129+
130+ // Note: [sendPacket, receiveFrame, frame ordering and NVCUVID callbacks]
131+ //
132+ // At a high level, this decoding interface mimics the FFmpeg send/receive
133+ // architecture:
134+ // - sendPacket(AVPacket) sends an AVPacket from the FFmpeg demuxer to the
135+ // NVCUVID parser.
136+ // - receiveFrame(AVFrame) is a non-blocking call:
137+ // - if a frame is ready **in display order**, it must return it. By display
138+ // order, we mean that receiveFrame() must return frames with increasing pts
139+ // values when called successively.
140+ // - if no frame is ready, it must return AVERROR(EAGAIN) to indicate the
141+ // caller should send more packets.
142+ //
143+ // The rest of this note assumes you have a reasonable level of familiarity with
144+ // the sendPacket/receiveFrame calling pattern. If you don't, look up the core
145+ // decoding loop in SingleVideoDecoder.
146+ //
147+ // The frame re-ordering problem:
148+ // Depending on the codec and on the encoding parameters, a packet from a video
149+ // stream may contain exactly one frame, more than one frame, or a fraction of a
150+ // frame. And, there may be non-linear frame dependencies because of B-frames,
151+ // which need both past *and* future frames to be decoded. Consider the
152+ // following stream, with frames presented in display order: I0 B1 P2 B3 P4 ...
153+ // - I0 is an I-frame (also key frame, can be decoded independently)
154+ // - B1 is a B-frame (bi-directional) which needs both I0 and P2 to be decoded
155+ // - P2 is a P-frame (predicted frame) which only needs I0 to be decodec.
156+ //
157+ // Because B1 needs both I0 and P2 to be properly decoded, the decode order must
158+ // be: I0 P2 B1 P4 B3 ... which is different from the display order.
159+ //
160+ // We don't have to worry about the decode order: it's up to the parser to
161+ // figure that out. But we have to make sure that receiveFrame() returns frames
162+ // in display order.
163+ //
164+ // SendPacket(AVPacket)'s job is just to send the packet to the NVCUVID parser
165+ // by calling cuvidParseVideoData(packet). When cuvidParseVideoData(packet) is
166+ // called, it may trigger callbacks, particularly:
167+ // - frameReadyForDecoding(picParams)): triggered **in decode order** when the
168+ // parser has accumulated enough data to decode a frame. We send that frame to
169+ // the NVDEC hardware for **async** decoding. While that frame is being
170+ // decoded, we store a light reference (a Slot) to that frame in the
171+ // frameBuffer_, and mark that slot as BEING_DECODED. The value that uniquely
172+ // identifies that frame in the frameBuffer_ is its "slotId", which is given
173+ // to us by NVCUVID in the callback parameter: picParams->CurrPicIdx.
174+ // - frameReadyInDisplayOrder(dispInfo)): triggered **in display order** when a
175+ // frame is ready to be "displayed" (returned). When it is triggered, we look
176+ // up the corresponding frame/slot in the frameBuffer_, using
177+ // dispInfo->picture_index to match it against a given BEING_DECODED slotId.
178+ // We mark that frame/slot as READY_FOR_OUTPUT.
179+ // Crucially, this callback also tells us the pts of that frame. We store
180+ // the pts and other relevant info the slot.
181+ //
182+ // Said differently, from the perspective of the frameBuffer_, at any point in
183+ // time a slot/frame in the frameBuffer_ can be in 3 states:
184+ // - empty: no slot for that slotId exists in the frameBuffer_
185+ // - BEING_DECODED: frameReadyForDecoding was triggered for that frame, and the
186+ // frame was sent to NVDEC for async decoding. We don't know its pts because
187+ // the parser didn't trigger frameReadyInDisplayOrder() for that frame yet.
188+ // - READY_FOR_OUTPUT: frameReadyInDisplayOrder was triggered for that frame, it
189+ // is decoded and ready to be mapped and returned. We know its pts.
190+ //
191+ // Because frameReadyInDisplayOrder is triggered in display order, we know that
192+ // if a slot is READY_FOR_OUTPUT, then all frames with a lower pts are also
193+ // READY_FOR_OUTPUT, or already returned. So when receiveFrame() is called, we
194+ // just need to look for the READY_FOR_OUTPUT slot with the lowest pts, and
195+ // return that frame. This guarantees that receiveFrame() returns frames in
196+ // display order. If no slot is READY_FOR_OUTPUT, then we return EAGAIN to
197+ // indicate the caller should send more packets.
198+ //
199+ // Simple, innit?
0 commit comments