Skip to content

Commit 7ed3779

Browse files
authored
Add note about audio decoding design (#581)
1 parent 611421e commit 7ed3779

File tree

1 file changed

+56
-1
lines changed

1 file changed

+56
-1
lines changed

src/torchcodec/decoders/_core/VideoDecoder.cpp

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -889,6 +889,58 @@ VideoDecoder::FrameBatchOutput VideoDecoder::getFramesPlayedInRange(
889889
return frameBatchOutput;
890890
}
891891

892+
// Note [Audio Decoding Design]
893+
// This note explains why audio decoding is implemented the way it is, and why
894+
// it inherently differs from video decoding.
895+
//
896+
// Like for video, FFmpeg exposes the concept of a frame for audio streams. An
897+
// audio frame is a contiguous sequence of samples, where a sample consists of
898+
// `numChannels` values. An audio frame, or a sequence thereof, is always
899+
// converted into a tensor of shape `(numChannels, numSamplesPerChannel)`.
900+
//
901+
// The notion of 'frame' in audio isn't what users want to interact with. Users
902+
// want to interact with samples. The C++ and core APIs return frames, because
903+
// we want those to be close to FFmpeg concepts, but the higher-level public
904+
// APIs expose samples. As a result:
905+
// - We don't expose index-based APIs for audio, because that would mean
906+
// exposing the concept of audio frame. For now, we think exposing time-based
907+
// APIs is more natural.
908+
// - We never perform a scan for audio streams. We don't need to, since we won't
909+
// be converting timestamps to indices. That's why we enforce the seek_mode
910+
// to be "approximate" (which is slightly misleading, because technically the
911+
// output samples will be at their exact positions. But this incongruence is
912+
// only exposed at the C++/core private levels).
913+
//
914+
// Audio frames are of variable dimensions: in the same stream, a frame can
915+
// contain 1024 samples and the next one may contain 512 [1]. This makes it
916+
// impossible to stack audio frames in the same way we can stack video frames.
917+
// This is one of the main reasons we cannot reuse the same pre-allocation logic
918+
// we have for videos in getFramesPlayedInRange(): pre-allocating a batch
919+
// requires constant (and known) frame dimensions. That's also why
920+
// *concatenated* along the samples dimension, not stacked.
921+
//
922+
// [IMPORTANT!] There is one key invariant that we must respect when decoding
923+
// audio frames:
924+
//
925+
// BEFORE DECODING FRAME i, WE MUST DECODE ALL FRAMES j < i.
926+
//
927+
// Always. Why? We don't know. What we know is that if we don't, we get clipped,
928+
// incorrect audio as output [2]. All other (correct) libraries like TorchAudio
929+
// or Decord do something similar, whether it was intended or not. This has a
930+
// few implications:
931+
// - The **only** place we're allowed to seek to in an audio stream is the
932+
// stream's beginning. This ensures that if we need a frame, we'll have
933+
// decoded all previous frames.
934+
// - Because of that, we don't allow the public APIs to seek. Public APIs can
935+
// call next() and `getFramesPlayedInRangeAudio()`, but they cannot manually
936+
// seek.
937+
// - We try not to seek, when we can avoid it. Typically if the next frame we
938+
// need is in the future, we don't seek back to the beginning, we just decode
939+
// all the frames in-between.
940+
//
941+
// [2] If you're brave and curious, you can read the long "Seek offset for
942+
// audio" note in https://github.com/pytorch/torchcodec/pull/507/files, which
943+
// sums up past (and failed) attemps at working around this issue.
892944
VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
893945
double startSeconds,
894946
std::optional<double> stopSecondsOptional) {
@@ -915,7 +967,7 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
915967
streamInfo.lastDecodedAvFrameDuration) {
916968
// If we need to seek backwards, then we have to seek back to the beginning
917969
// of the stream.
918-
// TODO-AUDIO: document why this is needed in a big comment.
970+
// See [Audio Decoding Design].
919971
setCursorPtsInSecondsInternal(INT64_MIN);
920972
}
921973

@@ -964,6 +1016,8 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
9641016
// --------------------------------------------------------------------------
9651017

9661018
void VideoDecoder::setCursorPtsInSeconds(double seconds) {
1019+
// We don't allow public audio decoding APIs to seek, see [Audio Decoding
1020+
// Design]
9671021
validateActiveStream(AVMEDIA_TYPE_VIDEO);
9681022
setCursorPtsInSecondsInternal(seconds);
9691023
}
@@ -1004,6 +1058,7 @@ bool VideoDecoder::canWeAvoidSeeking() const {
10041058
if (streamInfo.avMediaType == AVMEDIA_TYPE_AUDIO) {
10051059
// For audio, we only need to seek if a backwards seek was requested within
10061060
// getFramesPlayedInRangeAudio(), when setCursorPtsInSeconds() was called.
1061+
// For more context, see [Audio Decoding Design]
10071062
return !cursorWasJustSet_;
10081063
}
10091064
int64_t lastDecodedAvFramePts =

0 commit comments

Comments
 (0)