-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: partition pruning stats pruning when multiple values are present #18923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e present Signed-off-by: Nimalan <nimalan.m@protonmail.com>
dqkqd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Mark1626. This change makes sense to me.
|
|
||
| let values = HashSet::from([ScalarValue::from(1i32), ScalarValue::from(3i32)]); | ||
| let contained_a = partition_stats.contained(&column_a, &values).unwrap(); | ||
| let expected_contained_a = BooleanArray::from(vec![true, true]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stats:
| a | b |
|---|---|
| 1 | 2 |
| 3 | 4 |
Running contained on a with values = [1, 3] returns [true, true].
I wonder if we need some cases that return false, or None as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The earlier case covers false. I've added a test cases for the null scenario, changed the or to a or_kleene so it works for null
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| let column_a = Column::new_unqualified("a"); | ||
|
|
||
| let values = HashSet::from([ScalarValue::from(1i32), ScalarValue::from(3i32)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right to me 🤔
According to
https://docs.rs/datafusion/latest/datafusion/common/pruning/trait.PruningStatistics.html#tymethod.contained
The returned array has one row for each container, with the following meanings:
- true if the values in column ONLY contain values from values
- false if the values in column are NOT ANY of values
- null if the neither of the above holds or is unknown.
This test I think has a with values 1 and 2. So the result of contains for values (1,3) should be NULL as 3 is not in values...
Maybe I am missing something here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given no other tests fail, we clearly have some sort of test coverage gap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that @dqkqd had the same question here: #18922 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the partition columns are (a, b), the value of partition_values in the test represent two partitions (a=1, b=2) and (a=3, b=4).
The contained is done on an array of column a values [1, 3], and not a single tuple (a=1, b=2). Which is why the result is [true, true] in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- I thought the stats represented this (which is wrong).
| a | b |
|---|---|
| 1 | 3 |
| 2 | 4 |
I'll try and make a follow on PR with some more comments to try and make this clearer
|
Thanks @Mark1626 @adriangb @dqkqd and @xudong963 |
|
Here is one small PR |
|
I also made a PR with an additional test: |
…9021) ## Which issue does this PR close? - Follow on to #18923 ## Rationale for this change I was confused about some of the tests for `PartitionPruningStatistics` so let's add some more comments to explain what it is doing, and add additional coverage for multi-value columns ## What changes are included in this PR? Add a new test ## Are these changes tested? Only tests ## Are there any user-facing changes? No
Uh oh!
There was an error while loading. Please reload this page.