-
Notifications
You must be signed in to change notification settings - Fork 66
Proper resize tests; remove swscale resize #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // we could achieve that by calling sws_scale() twice: once to do the resize | ||
| // and another time to do the format conversion. But that will be slower, | ||
| // which goes against the whole point of calling sws_scale() directly. | ||
| std::string filters_ = "format=rgb24"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug, do you think I should pull this explanation out into a named section, and then refer to it in a few different places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine here. I'd just suggest to remove (or re-write?) the last paragraph because it seems redundant with the second.
Is there any difference between "copy" and "format=rgb24" when passing AV_PIX_FMT_RGB24? I think we have made this slight change now.
filtersContext(
avFrame->width,
avFrame->height,
avFrameFormat,
avFrame->sample_aspect_ratio,
outputDims.width,
outputDims.height,
/*outputFormat=*/AV_PIX_FMT_RGB24,
filters_, # "copy" vs "format=rgb24" ????
timeBase_);I might suggest to leave the default to "copy"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used "copy" before because the empty string is not a valid filtergraph (I tried). So we needed to provide something, and "copy" is the closest I could find to a no-op. That, however, was back when the transforms were okay to happen in the input colorspace; we were okay with the automatically inserted format conversions.
Now that we want the transforms to happen in RGB24, we need to explicitly specify the format conversion. For the no-transformations case, there is no difference between explicitly specifying "format=rgb24" and not. However, when there are transformations, the explicit "format=rgb24" makes sure that the transforms happen in RGB24 colorspace. Without it, filtergraph will automatically insert a format after our transforms to do the conversion we specified when we said the output node is AV_PIX_FMT_RGB24.
With all that said, I don't love the current logic as it does require more knowledge. I'm going to try to make it more clear without introducing "copy".
| return "scale=" + std::to_string(outputDims_.width) + ":" + | ||
| std::to_string(outputDims_.height) + | ||
| ":sws_flags=" + toFilterGraphInterpolation(interpolationMode_); | ||
| ":flags=" + toFilterGraphInterpolation(interpolationMode_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the FFmpeg docs:
Libavfilter will automatically insert
scalefilters where format conversion is required. It is possible to specify swscale flags for those automatically inserted scalers by prependingsws_flags=flags;to the filtergraph description.
Whereas flags is the specific parameter to scale. They end up being semantically equivalent, but it's more clear to use the scale option here.
| # As a consequence, they are not going to be identical. | ||
| assert_tensor_close_on_at_least( | ||
| frame_resize, frame_tv, percentage=85, atol=3 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug, is this tolerable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is promising, I'd suggest to be a bit more strict:
- test on
TEST_SRC_2_720Pinstead of the nasa video. The nasa vid is really small, and in parts mostly pitch black. - Increase the percentage. Looks like we can push it to 99.9.
- assert an upper bound on the max abs diff. Looks like the max abs diff is 4. It's higher than I hoped, but on 0.1% of the pixels it's probably OK
- assert that those tests don't pass when comparing against
antialias=False.
I wanted to check for all of these myself so let me share my diff:
diff --git a/test/test_transform_ops.py b/test/test_transform_ops.py
index 818de0a..62606e2 100644
--- a/test/test_transform_ops.py
+++ b/test/test_transform_ops.py
@@ -31,6 +31,7 @@ from .utils import (
H265_VIDEO,
NASA_VIDEO,
needs_cuda,
+ TEST_SRC_2_720P,
)
torch._dynamo.config.capture_dynamic_output_shape_ops = True
@@ -171,31 +172,48 @@ class TestCoreVideoDecoderTransformOps:
((1.5, 1.31), (0.5, 0.71), (0.7, 1.31), (1.5, 0.71), (1.0, 1.0), (2.0, 2.0)),
)
def test_resize_torchvision(self, height_scaling_factor, width_scaling_factor):
- height = int(NASA_VIDEO.get_height() * height_scaling_factor)
- width = int(NASA_VIDEO.get_width() * width_scaling_factor)
+ asset = TEST_SRC_2_720P
+ height = int(asset.get_height() * height_scaling_factor)
+ width = int(asset.get_width() * width_scaling_factor)
resize_spec = f"resize, {height}, {width}"
- decoder_resize = create_from_file(str(NASA_VIDEO.path))
+ decoder_resize = create_from_file(str(asset.path))
add_video_stream(decoder_resize, transform_specs=resize_spec)
- decoder_full = create_from_file(str(NASA_VIDEO.path))
+ decoder_full = create_from_file(str(asset.path))
add_video_stream(decoder_full)
- for frame_index in [0, 10, 17, 100, 230, 389]:
- expected_shape = (NASA_VIDEO.get_num_color_channels(), height, width)
+ for frame_index in [0, 10, 17, 20, 40, 50]:
+ expected_shape = (asset.get_num_color_channels(), height, width)
frame_resize, *_ = get_frame_at_index(
decoder_resize, frame_index=frame_index
)
frame_full, *_ = get_frame_at_index(decoder_full, frame_index=frame_index)
frame_tv = v2.functional.resize(frame_full, size=(height, width))
+ frame_tv_no_antialias = v2.functional.resize(
+ frame_full, size=(height, width), antialias=False
+ )
assert frame_resize.shape == expected_shape
assert frame_tv.shape == expected_shape
+ assert frame_tv_no_antialias.shape == expected_shape
assert_tensor_close_on_at_least(
- frame_resize, frame_tv, percentage=99, atol=1
+ frame_resize, frame_tv, percentage=99.9, atol=1
)
+ torch.testing.assert_allclose(frame_resize, frame_tv, rtol=0, atol=4)
+
+ if height_scaling_factor < 1 or width_scaling_factor < 1:
+ # Antialias only relevant when down-scaling!
+ with pytest.raises(AssertionError, match="Expected at least"):
+ assert_tensor_close_on_at_least(
+ frame_resize, frame_tv_no_antialias, percentage=99, atol=1
+ )
+ with pytest.raises(AssertionError, match="Tensor-likes are not close"):
+ torch.testing.assert_allclose(
+ frame_resize, frame_tv_no_antialias, rtol=0, atol=4
+ )
def test_resize_ffmpeg(self):
height = 135There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug, see code below: for the frames I'm testing against in the video, I had to increase atol from 4 to 6.
| assert_frames_equal(frame, frame_tv) | ||
| assert_frames_equal(frame, frame_ref) | ||
| assert_frames_equal(frame_crop, frame_ref) | ||
| assert_frames_equal(frame_crop, frame_tv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointing out that here we get bit-for-bit equality with TorchVision's crop.
| // we could achieve that by calling sws_scale() twice: once to do the resize | ||
| // and another time to do the format conversion. But that will be slower, | ||
| // which goes against the whole point of calling sws_scale() directly. | ||
| std::string filters_ = "format=rgb24"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine here. I'd just suggest to remove (or re-write?) the last paragraph because it seems redundant with the second.
Is there any difference between "copy" and "format=rgb24" when passing AV_PIX_FMT_RGB24? I think we have made this slight change now.
filtersContext(
avFrame->width,
avFrame->height,
avFrameFormat,
avFrame->sample_aspect_ratio,
outputDims.width,
outputDims.height,
/*outputFormat=*/AV_PIX_FMT_RGB24,
filters_, # "copy" vs "format=rgb24" ????
timeBase_);I might suggest to leave the default to "copy"?
| // Note that we ensure that the transforms come AFTER the format conversion. | ||
| // This means that the transforms are applied in the output pixel format and | ||
| // colorspace. | ||
| filters_ += "," + filters.str(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below I'm suggesting that maybe we could leave the default of filters_ to "copy", because I didn't understand this comment at first. I think the comment is only clear if one knows that the default of filters_ is the format conversion. If we keep copy as the default, then this becomes
| filters_ += "," + filters.str(); | |
| filters_ = "format=rgb24" + filters.str(); |
and this comment is more obvious.
I don't have a strong opinion and if I'm not making any sense, please feel free to ignore (it's the end of the day!)
I forked this from #1003, which is about the public API for decoder-native transforms. When working on that I realized we didn't have proper resize transform tests, and when working on those tests, I realized we needed to change transform behavior. I want to keep #1003 about the public API, so this PR is about those tests and the behavior change.
Because I've flip-flopped on this point, I want to make this status explicit: all transforms are applied after the conversion to their output pixel format and colorspace. In other words, all transforms happen in the output pixel format and colorspace (which is now only RGB24). We want the output from TorchCodec's decoder-native transforms to be as close as possible to passing untransformed frames from TorchCodec to TorchVision transforms.
This PR does two things:
ffmpegCLI with an explicitscalefilter. We expect bit-for-bit equality on CPU and Linux.torchvision.transforms.v2.function.resize(). We expect some difference here, but it's small: 99% of all values are within 1.sw_scale()directly instead of going through filtergraph. We remove that special handling here. When there are no transforms, we can still usesw_scale()for our format conversion. The reason we remove this special handling is that when callingsw_scale()once to do both a resize and colorspace conversion, it will do the colorspace conversion first. That does not match what we do when calling filtergraph. In order to get the same behavior as with filtergraph, we will have to callsw_scale()twice: once to do the resize, and another time to do the colorspace conversion. We will need to benchmark that doing so is faster than calling filtergraph. I have created issue Resize transform optimization: use swscale when appropriate #1018 to track.