fix: partition pruning stats pruning when multiple values are present #18923

Mark1626 · 2025-11-25T05:20:14Z

Closes Incorrectly result from PartitionPruningStatistics when multiple values are present #18922

…e present Signed-off-by: Nimalan <nimalan.m@protonmail.com>

dqkqd

Thank you @Mark1626. This change makes sense to me.

dqkqd · 2025-11-25T21:48:09Z

datafusion/common/src/pruning.rs

+
+        let values = HashSet::from([ScalarValue::from(1i32), ScalarValue::from(3i32)]);
+        let contained_a = partition_stats.contained(&column_a, &values).unwrap();
+        let expected_contained_a = BooleanArray::from(vec![true, true]);


The stats:

a b

1 2

3 4

Running contained on a with values = [1, 3] returns [true, true].

I wonder if we need some cases that return false, or None as well.

The earlier case covers false. I've added a test cases for the null scenario, changed the or to a or_kleene so it works for null

adriangb

Thanks for the fix!

alamb

Thanks for this contribution @Mark1626 and for the reviews @adriangb and @dqkqd

alamb · 2025-11-30T22:53:56Z

datafusion/common/src/pruning.rs

+
+        let column_a = Column::new_unqualified("a");
+
+        let values = HashSet::from([ScalarValue::from(1i32), ScalarValue::from(3i32)]);


This doesn't seem right to me 🤔

According to
https://docs.rs/datafusion/latest/datafusion/common/pruning/trait.PruningStatistics.html#tymethod.contained

The returned array has one row for each container, with the following meanings:

true if the values in column ONLY contain values from values

false if the values in column are NOT ANY of values

null if the neither of the above holds or is unknown.

This test I think has a with values 1 and 2. So the result of contains for values (1,3) should be NULL as 3 is not in values...

Maybe I am missing something here

Given no other tests fail, we clearly have some sort of test coverage gap

I see now that @dqkqd had the same question here: #18922 (comment)

Given the partition columns are (a, b), the value of partition_values in the test represent two partitions (a=1, b=2) and (a=3, b=4).

The contained is done on an array of column a values [1, 3], and not a single tuple (a=1, b=2). Which is why the result is [true, true] in this case

I see -- I thought the stats represented this (which is wrong).

a b

1 3

2 4

I'll try and make a follow on PR with some more comments to try and make this clearer

alamb · 2025-12-01T14:37:42Z

Thanks @Mark1626 @adriangb @dqkqd and @xudong963

alamb · 2025-12-01T14:53:40Z

Here is one small PR

Add documentation example for PartitionPruningStatistics #19020

alamb · 2025-12-01T15:35:11Z

I also made a PR with an additional test:

Add additional test coverage of multi-value PartitionPruningStats #19021

…9021) ## Which issue does this PR close? - Follow on to #18923 ## Rationale for this change I was confused about some of the tests for `PartitionPruningStatistics` so let's add some more comments to explain what it is doing, and add additional coverage for multi-value columns ## What changes are included in this PR? Add a new test ## Are these changes tested? Only tests ## Are there any user-facing changes? No

fix: partition pruning stats incorrect result when multiple values ar…

d03f31a

…e present Signed-off-by: Nimalan <nimalan.m@protonmail.com>

github-actions bot added the common Related to common crate label Nov 25, 2025

dqkqd approved these changes Nov 25, 2025

View reviewed changes

fix: Use or_kleene, add test case when value is null

0777351

adriangb approved these changes Nov 26, 2025

View reviewed changes

alamb reviewed Nov 30, 2025

View reviewed changes

xudong963 approved these changes Dec 1, 2025

View reviewed changes

alamb added this pull request to the merge queue Dec 1, 2025

Merged via the queue into apache:main with commit 0f1133e Dec 1, 2025
28 checks passed

alamb mentioned this pull request Dec 1, 2025

Add documentation example for PartitionPruningStatistics #19020

Open

alamb mentioned this pull request Dec 1, 2025

Add additional test coverage of multi-value PartitionPruningStats #19021

Merged


		let column_a = Column::new_unqualified("a");

		let values = HashSet::from([ScalarValue::from(1i32), ScalarValue::from(3i32)]);

fix: partition pruning stats pruning when multiple values are present #18923

fix: partition pruning stats pruning when multiple values are present #18923

Uh oh!

Conversation

Mark1626 commented Nov 25, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dqkqd left a comment

Choose a reason for hiding this comment

Uh oh!

dqkqd Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Mark1626 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Mark1626 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alamb commented Dec 1, 2025

Uh oh!

alamb commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mark1626 commented Nov 25, 2025 •

edited by alamb

Loading

Mark1626 Dec 1, 2025 •

edited

Loading

alamb commented Dec 1, 2025 •

edited

Loading