Skip to content

Conversation

@scotts
Copy link
Contributor

@scotts scotts commented Oct 30, 2025

I forked this from #1003, which is about the public API for decoder-native transforms. When working on that I realized we didn't have proper resize transform tests, and when working on those tests, I realized we needed to change transform behavior. I want to keep #1003 about the public API, so this PR is about those tests and the behavior change.

Because I've flip-flopped on this point, I want to make this status explicit: all transforms are applied after the conversion to their output pixel format and colorspace. In other words, all transforms happen in the output pixel format and colorspace (which is now only RGB24). We want the output from TorchCodec's decoder-native transforms to be as close as possible to passing untransformed frames from TorchCodec to TorchVision transforms.

This PR does two things:

  1. Adds proper resize tests which compare TorchCodec using the decoder-native resize against:
    1. Checked-in frames generated by the ffmpeg CLI with an explicit scale filter. We expect bit-for-bit equality on CPU and Linux.
    2. Passing a full, untransformed frame from TorchCodec to torchvision.transforms.v2.function.resize(). We expect some difference here, but it's small: 99% of all values are within 1.
  2. Removes all special handling for resize. On main, if there is a single transform, and it's a resize, then we call sw_scale() directly instead of going through filtergraph. We remove that special handling here. When there are no transforms, we can still use sw_scale() for our format conversion. The reason we remove this special handling is that when calling sw_scale() once to do both a resize and colorspace conversion, it will do the colorspace conversion first. That does not match what we do when calling filtergraph. In order to get the same behavior as with filtergraph, we will have to call sw_scale() twice: once to do the resize, and another time to do the colorspace conversion. We will need to benchmark that doing so is faster than calling filtergraph. I have created issue Resize transform optimization: use swscale when appropriate #1018 to track.
  3. Removes scaling from all of the existing color conversion library tests as swscale no longer allows that.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 30, 2025
// we could achieve that by calling sws_scale() twice: once to do the resize
// and another time to do the format conversion. But that will be slower,
// which goes against the whole point of calling sws_scale() directly.
std::string filters_ = "format=rgb24";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, do you think I should pull this explanation out into a named section, and then refer to it in a few different places?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine here. I'd just suggest to remove (or re-write?) the last paragraph because it seems redundant with the second.


Is there any difference between "copy" and "format=rgb24" when passing AV_PIX_FMT_RGB24? I think we have made this slight change now.

filtersContext(
      avFrame->width,
      avFrame->height,
      avFrameFormat,
      avFrame->sample_aspect_ratio,
      outputDims.width,
      outputDims.height,
      /*outputFormat=*/AV_PIX_FMT_RGB24,
      filters_,  # "copy" vs "format=rgb24" ????
      timeBase_);

I might suggest to leave the default to "copy"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used "copy" before because the empty string is not a valid filtergraph (I tried). So we needed to provide something, and "copy" is the closest I could find to a no-op. That, however, was back when the transforms were okay to happen in the input colorspace; we were okay with the automatically inserted format conversions.

Now that we want the transforms to happen in RGB24, we need to explicitly specify the format conversion. For the no-transformations case, there is no difference between explicitly specifying "format=rgb24" and not. However, when there are transformations, the explicit "format=rgb24" makes sure that the transforms happen in RGB24 colorspace. Without it, filtergraph will automatically insert a format after our transforms to do the conversion we specified when we said the output node is AV_PIX_FMT_RGB24.

With all that said, I don't love the current logic as it does require more knowledge. I'm going to try to make it more clear without introducing "copy".

return "scale=" + std::to_string(outputDims_.width) + ":" +
std::to_string(outputDims_.height) +
":sws_flags=" + toFilterGraphInterpolation(interpolationMode_);
":flags=" + toFilterGraphInterpolation(interpolationMode_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the FFmpeg docs:

Libavfilter will automatically insert scale filters where format conversion is required. It is possible to specify swscale flags for those automatically inserted scalers by prepending sws_flags=flags; to the filtergraph description.

Whereas flags is the specific parameter to scale. They end up being semantically equivalent, but it's more clear to use the scale option here.

# As a consequence, they are not going to be identical.
assert_tensor_close_on_at_least(
frame_resize, frame_tv, percentage=85, atol=3
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, is this tolerable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is promising, I'd suggest to be a bit more strict:

  • test on TEST_SRC_2_720P instead of the nasa video. The nasa vid is really small, and in parts mostly pitch black.
  • Increase the percentage. Looks like we can push it to 99.9.
  • assert an upper bound on the max abs diff. Looks like the max abs diff is 4. It's higher than I hoped, but on 0.1% of the pixels it's probably OK
  • assert that those tests don't pass when comparing against antialias=False.

I wanted to check for all of these myself so let me share my diff:

diff --git a/test/test_transform_ops.py b/test/test_transform_ops.py
index 818de0a..62606e2 100644
--- a/test/test_transform_ops.py
+++ b/test/test_transform_ops.py
@@ -31,6 +31,7 @@ from .utils import (
     H265_VIDEO,
     NASA_VIDEO,
     needs_cuda,
+    TEST_SRC_2_720P,
 )
 
 torch._dynamo.config.capture_dynamic_output_shape_ops = True
@@ -171,31 +172,48 @@ class TestCoreVideoDecoderTransformOps:
         ((1.5, 1.31), (0.5, 0.71), (0.7, 1.31), (1.5, 0.71), (1.0, 1.0), (2.0, 2.0)),
     )
     def test_resize_torchvision(self, height_scaling_factor, width_scaling_factor):
-        height = int(NASA_VIDEO.get_height() * height_scaling_factor)
-        width = int(NASA_VIDEO.get_width() * width_scaling_factor)
+        asset = TEST_SRC_2_720P
+        height = int(asset.get_height() * height_scaling_factor)
+        width = int(asset.get_width() * width_scaling_factor)
         resize_spec = f"resize, {height}, {width}"
 
-        decoder_resize = create_from_file(str(NASA_VIDEO.path))
+        decoder_resize = create_from_file(str(asset.path))
         add_video_stream(decoder_resize, transform_specs=resize_spec)
 
-        decoder_full = create_from_file(str(NASA_VIDEO.path))
+        decoder_full = create_from_file(str(asset.path))
         add_video_stream(decoder_full)
 
-        for frame_index in [0, 10, 17, 100, 230, 389]:
-            expected_shape = (NASA_VIDEO.get_num_color_channels(), height, width)
+        for frame_index in [0, 10, 17, 20, 40, 50]:
+            expected_shape = (asset.get_num_color_channels(), height, width)
             frame_resize, *_ = get_frame_at_index(
                 decoder_resize, frame_index=frame_index
             )
 
             frame_full, *_ = get_frame_at_index(decoder_full, frame_index=frame_index)
             frame_tv = v2.functional.resize(frame_full, size=(height, width))
+            frame_tv_no_antialias = v2.functional.resize(
+                frame_full, size=(height, width), antialias=False
+            )
 
             assert frame_resize.shape == expected_shape
             assert frame_tv.shape == expected_shape
+            assert frame_tv_no_antialias.shape == expected_shape
 
             assert_tensor_close_on_at_least(
-                frame_resize, frame_tv, percentage=99, atol=1
+                frame_resize, frame_tv, percentage=99.9, atol=1
             )
+            torch.testing.assert_allclose(frame_resize, frame_tv, rtol=0, atol=4)
+
+            if height_scaling_factor < 1 or width_scaling_factor < 1:
+                # Antialias only relevant when down-scaling!
+                with pytest.raises(AssertionError, match="Expected at least"):
+                    assert_tensor_close_on_at_least(
+                        frame_resize, frame_tv_no_antialias, percentage=99, atol=1
+                    )
+                with pytest.raises(AssertionError, match="Tensor-likes are not close"):
+                    torch.testing.assert_allclose(
+                        frame_resize, frame_tv_no_antialias, rtol=0, atol=4
+                    )
 
     def test_resize_ffmpeg(self):
         height = 135

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, see code below: for the frames I'm testing against in the video, I had to increase atol from 4 to 6.

@scotts scotts marked this pull request as ready for review October 31, 2025 02:38
@scotts scotts marked this pull request as draft October 31, 2025 18:00
@scotts scotts changed the title Transforms now happen in input frame colorspace Proper resize tests; remove swscale resize Nov 1, 2025
@scotts scotts marked this pull request as ready for review November 3, 2025 03:21
assert_frames_equal(frame, frame_tv)
assert_frames_equal(frame, frame_ref)
assert_frames_equal(frame_crop, frame_ref)
assert_frames_equal(frame_crop, frame_tv)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointing out that here we get bit-for-bit equality with TorchVision's crop.

// we could achieve that by calling sws_scale() twice: once to do the resize
// and another time to do the format conversion. But that will be slower,
// which goes against the whole point of calling sws_scale() directly.
std::string filters_ = "format=rgb24";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine here. I'd just suggest to remove (or re-write?) the last paragraph because it seems redundant with the second.


Is there any difference between "copy" and "format=rgb24" when passing AV_PIX_FMT_RGB24? I think we have made this slight change now.

filtersContext(
      avFrame->width,
      avFrame->height,
      avFrameFormat,
      avFrame->sample_aspect_ratio,
      outputDims.width,
      outputDims.height,
      /*outputFormat=*/AV_PIX_FMT_RGB24,
      filters_,  # "copy" vs "format=rgb24" ????
      timeBase_);

I might suggest to leave the default to "copy"?

// Note that we ensure that the transforms come AFTER the format conversion.
// This means that the transforms are applied in the output pixel format and
// colorspace.
filters_ += "," + filters.str();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below I'm suggesting that maybe we could leave the default of filters_ to "copy", because I didn't understand this comment at first. I think the comment is only clear if one knows that the default of filters_ is the format conversion. If we keep copy as the default, then this becomes

Suggested change
filters_ += "," + filters.str();
filters_ = "format=rgb24" + filters.str();

and this comment is more obvious.

I don't have a strong opinion and if I'm not making any sense, please feel free to ignore (it's the end of the day!)

@scotts scotts merged commit afd5aba into meta-pytorch:main Nov 5, 2025
65 of 66 checks passed
@scotts scotts deleted the proper_resize_test branch November 5, 2025 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants