Skip to content

Conversation

@Mark1626
Copy link
Contributor

@Mark1626 Mark1626 commented Nov 25, 2025

…e present

Signed-off-by: Nimalan <nimalan.m@protonmail.com>
@github-actions github-actions bot added the common Related to common crate label Nov 25, 2025
Copy link
Contributor

@dqkqd dqkqd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Mark1626. This change makes sense to me.


let values = HashSet::from([ScalarValue::from(1i32), ScalarValue::from(3i32)]);
let contained_a = partition_stats.contained(&column_a, &values).unwrap();
let expected_contained_a = BooleanArray::from(vec![true, true]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stats:

a b
1 2
3 4

Running contained on a with values = [1, 3] returns [true, true].

I wonder if we need some cases that return false, or None as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier case covers false. I've added a test cases for the null scenario, changed the or to a or_kleene so it works for null

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @Mark1626 and for the reviews @adriangb and @dqkqd


let column_a = Column::new_unqualified("a");

let values = HashSet::from([ScalarValue::from(1i32), ScalarValue::from(3i32)]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me 🤔

According to
https://docs.rs/datafusion/latest/datafusion/common/pruning/trait.PruningStatistics.html#tymethod.contained

The returned array has one row for each container, with the following meanings:

  • true if the values in column ONLY contain values from values
  • false if the values in column are NOT ANY of values
  • null if the neither of the above holds or is unknown.

This test I think has a with values 1 and 2. So the result of contains for values (1,3) should be NULL as 3 is not in values...

Maybe I am missing something here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given no other tests fail, we clearly have some sort of test coverage gap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that @dqkqd had the same question here: #18922 (comment)

Copy link
Contributor Author

@Mark1626 Mark1626 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the partition columns are (a, b), the value of partition_values in the test represent two partitions (a=1, b=2) and (a=3, b=4).

The contained is done on an array of column a values [1, 3], and not a single tuple (a=1, b=2). Which is why the result is [true, true] in this case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- I thought the stats represented this (which is wrong).

a b
1 3
2 4

I'll try and make a follow on PR with some more comments to try and make this clearer

@alamb
Copy link
Contributor

alamb commented Dec 1, 2025

Thanks @Mark1626 @adriangb @dqkqd and @xudong963

@alamb alamb added this pull request to the merge queue Dec 1, 2025
Merged via the queue into apache:main with commit 0f1133e Dec 1, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 1, 2025

@alamb
Copy link
Contributor

alamb commented Dec 1, 2025

I also made a PR with an additional test:

github-merge-queue bot pushed a commit that referenced this pull request Dec 1, 2025
…9021)

## Which issue does this PR close?

- Follow on to #18923

## Rationale for this change

I was confused about some of the tests for `PartitionPruningStatistics`
so let's add some
more comments to explain what it is doing, and add additional coverage
for multi-value columns


## What changes are included in this PR?

Add a new test 

## Are these changes tested?

Only tests 
## Are there any user-facing changes?

No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrectly result from PartitionPruningStatistics when multiple values are present

5 participants