-
Notifications
You must be signed in to change notification settings - Fork 14
Add a tool to check if row groups .min / .max are strictly increasing within a parquet file
#40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fec818b to
2914f42
Compare
|
TBD: is it expected that urls are sorted in between the parquet files, ie should |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @damian0815!
Would you mind adding some context to the description of the PR? Namely checking for #12 and a short description how the tool works. The latter could be also in the README or the command-line help of the tool.
Since the tools checks only the row group metadata whether the min/max values of a single column overlap, its name is_table_sorted.py is not quite precise resp. may raise undeliverable expectations. Maybe the name and corresponding function names can be adjusted?
I've successfully tested the tool on data from CC-MAIN-2022-05 (#12 not yet fixed) and CC-MAIN-2022-21 (#12 fixed):
- it failed to detect that the column url_surtkey is not properly sorted on some input files of the first crawl. Definitely, if there is only a single row group. That's not unlikely for the robots.txt partition, e.g. this file.
- but if run over more or all files the test works.
…hecking Signed-off-by: Damian Stewart <ot@damianstewart.com>
.min / .max are strictly increasing within a parquet file
|
I have updated the title and description to better correspond with what the tool does. |
Determined: this is not intended, ie part-00001.max may be out of order w.r.t part-00002.min |
|
@damian0815 This is waiting on @sebastian-nagel to re-review with your changes, correct? |
|
@jenenglish that's correct yes |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @damian0815, thanks! Please, see the inline comments...
| for row_group_index in range(pf.num_row_groups): | ||
| row_group = pf.metadata.row_group(row_group_index) | ||
| column = row_group.column(sort_column_index) | ||
| if prev_max is not None and prev_max > column.statistics.min: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this strict condition: values in url_surtkey are not unique and it may happen (although the probability is low) that two rows with the same SURT key end up in two row groups. Than the tool reports an error, although the column might be perfectly sorted.
| for row_group_index in range(pf.num_row_groups): | ||
| row_group = pf.metadata.row_group(row_group_index) | ||
| column = row_group.column(sort_column_index) | ||
| if prev_max is not None and prev_max > column.statistics.min: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must prepared that no min/max statistics are available in a row group because of overlong URLs / SURT keys. Cf. PARQUET-1685:
if prev_max is not None and prev_max > column.statistics.min:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>' not supported between instances of 'str' and 'NoneType'
Add a tool to check if row groups
.min/.maxfor a particular column (egurl_surtkey) are strictly increasing within a particular parquet file or collection of parquet files; see README for more information and limitations - in particular, this does not check of the rows are sorted, just that the row groups min/max within a single parquet file are strictly increasing. The tool is intended to help check for #12.