@@ -889,6 +889,58 @@ VideoDecoder::FrameBatchOutput VideoDecoder::getFramesPlayedInRange(
889889 return frameBatchOutput;
890890}
891891
892+ // Note [Audio Decoding Design]
893+ // This note explains why audio decoding is implemented the way it is, and why
894+ // it inherently differs from video decoding.
895+ //
896+ // Like for video, FFmpeg exposes the concept of a frame for audio streams. An
897+ // audio frame is a contiguous sequence of samples, where a sample consists of
898+ // `numChannels` values. An audio frame, or a sequence thereof, is always
899+ // converted into a tensor of shape `(numChannels, numSamplesPerChannel)`.
900+ //
901+ // The notion of 'frame' in audio isn't what users want to interact with. Users
902+ // want to interact with samples. The C++ and core APIs return frames, because
903+ // we want those to be close to FFmpeg concepts, but the higher-level public
904+ // APIs expose samples. As a result:
905+ // - We don't expose index-based APIs for audio, because that would mean
906+ // exposing the concept of audio frame. For now, we think exposing time-based
907+ // APIs is more natural.
908+ // - We never perform a scan for audio streams. We don't need to, since we won't
909+ // be converting timestamps to indices. That's why we enforce the seek_mode
910+ // to be "approximate" (which is slightly misleading, because technically the
911+ // output samples will be at their exact positions. But this incongruence is
912+ // only exposed at the C++/core private levels).
913+ //
914+ // Audio frames are of variable dimensions: in the same stream, a frame can
915+ // contain 1024 samples and the next one may contain 512 [1]. This makes it
916+ // impossible to stack audio frames in the same way we can stack video frames.
917+ // This is one of the main reasons we cannot reuse the same pre-allocation logic
918+ // we have for videos in getFramesPlayedInRange(): pre-allocating a batch
919+ // requires constant (and known) frame dimensions. That's also why
920+ // *concatenated* along the samples dimension, not stacked.
921+ //
922+ // [IMPORTANT!] There is one key invariant that we must respect when decoding
923+ // audio frames:
924+ //
925+ // BEFORE DECODING FRAME i, WE MUST DECODE ALL FRAMES j < i.
926+ //
927+ // Always. Why? We don't know. What we know is that if we don't, we get clipped,
928+ // incorrect audio as output [2]. All other (correct) libraries like TorchAudio
929+ // or Decord do something similar, whether it was intended or not. This has a
930+ // few implications:
931+ // - The **only** place we're allowed to seek to in an audio stream is the
932+ // stream's beginning. This ensures that if we need a frame, we'll have
933+ // decoded all previous frames.
934+ // - Because of that, we don't allow the public APIs to seek. Public APIs can
935+ // call next() and `getFramesPlayedInRangeAudio()`, but they cannot manually
936+ // seek.
937+ // - We try not to seek, when we can avoid it. Typically if the next frame we
938+ // need is in the future, we don't seek back to the beginning, we just decode
939+ // all the frames in-between.
940+ //
941+ // [2] If you're brave and curious, you can read the long "Seek offset for
942+ // audio" note in https://github.com/pytorch/torchcodec/pull/507/files, which
943+ // sums up past (and failed) attemps at working around this issue.
892944VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio (
893945 double startSeconds,
894946 std::optional<double > stopSecondsOptional) {
@@ -915,7 +967,7 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
915967 streamInfo.lastDecodedAvFrameDuration ) {
916968 // If we need to seek backwards, then we have to seek back to the beginning
917969 // of the stream.
918- // TODO-AUDIO: document why this is needed in a big comment .
970+ // See [Audio Decoding Design] .
919971 setCursorPtsInSecondsInternal (INT64_MIN);
920972 }
921973
@@ -964,6 +1016,8 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
9641016// --------------------------------------------------------------------------
9651017
9661018void VideoDecoder::setCursorPtsInSeconds (double seconds) {
1019+ // We don't allow public audio decoding APIs to seek, see [Audio Decoding
1020+ // Design]
9671021 validateActiveStream (AVMEDIA_TYPE_VIDEO);
9681022 setCursorPtsInSecondsInternal (seconds);
9691023}
@@ -1004,6 +1058,7 @@ bool VideoDecoder::canWeAvoidSeeking() const {
10041058 if (streamInfo.avMediaType == AVMEDIA_TYPE_AUDIO) {
10051059 // For audio, we only need to seek if a backwards seek was requested within
10061060 // getFramesPlayedInRangeAudio(), when setCursorPtsInSeconds() was called.
1061+ // For more context, see [Audio Decoding Design]
10071062 return !cursorWasJustSet_;
10081063 }
10091064 int64_t lastDecodedAvFramePts =
0 commit comments