-
Notifications
You must be signed in to change notification settings - Fork 128
feat!: Add read_parquet_schema API to ParquetHandler #1498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: Add read_parquet_schema API to ParquetHandler #1498
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1498 +/- ##
==========================================
+ Coverage 84.77% 84.84% +0.07%
==========================================
Files 126 126
Lines 35755 35967 +212
Branches 35755 35967 +212
==========================================
+ Hits 30310 30515 +205
+ Misses 3973 3966 -7
- Partials 1472 1486 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
kernel/src/lib.rs
Outdated
| /// - The file is not a valid Parquet file | ||
| /// - The footer cannot be read or parsed | ||
| /// - The schema cannot be converted to Delta Kernel's format | ||
| fn get_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only ever need to read one footer or should this be a batch operation?
Second question, is on performance/caching. I assume we will retrieve the footer and then subsequently do a read on the files. Are we just assuming the engines will do any caching necessary here? Did you consider the alternative to do add an API like:
trait ParquetFile {
get_schema() -> SchemaRef
}
...
fn read_files {
&self,
files: &[Box<dyn ParquetFile>],
physical_schema: SchemaRef,
predicate: Option<PredicateRef>,
}
To make it so engines have an easier time caching the necessary info. Maybe more of a question for @nicklan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For now, this api is used to fetch checkpoint schema so we don't need to batch.
- This is a great idea. I'll leave it as a followup for now since this requires introducing ParquetFiles / schemas to Kernel.
kernel/src/engine/default/parquet.rs
Outdated
| // Verify this is a checkpoint schema with expected fields | ||
| let field_names: Vec<&String> = schema.fields().map(|f| f.name()).collect(); | ||
| assert!( | ||
| field_names.iter().any(|&name| name == "txn"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably test more explicitly on the exact schema we expect instead of just the presence of these fields? In particular nesting.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main question is on API design and field IDs (which if included we should add tests on). Secondary concern is making the tests more robust for the implementations to ensure we are reading exactly what we expect rather then just field presence.
kernel/src/lib.rs
Outdated
| /// [`StructField`]: crate::schema::StructField | ||
| /// [`StructField::get_config_value`]: crate::schema::StructField::get_config_value | ||
| /// [`ColumnMetadataKey::ParquetFieldId`]: crate::schema::ColumnMetadataKey::ParquetFieldId | ||
| fn read_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will probably want to get row-group stats at some point. Happy to defer this until we need it, but I think fn read_parquet_footer -> DeltaResult<Footer>,
struct Footer {
schema : SchemaRef
}
Could allow for less code churn in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, @nicklan does this make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that makes sense and is a good change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed function return ParquetFooter
nicklan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly good, just a couple small things
kernel/src/lib.rs
Outdated
| /// [`StructField`]: crate::schema::StructField | ||
| /// [`StructField::get_config_value`]: crate::schema::StructField::get_config_value | ||
| /// [`ColumnMetadataKey::ParquetFieldId`]: crate::schema::ColumnMetadataKey::ParquetFieldId | ||
| fn read_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that makes sense and is a good change
kernel/src/engine/default/parquet.rs
Outdated
| .clone() | ||
| }; | ||
|
|
||
| let top_level_fields = ["txn", "add", "remove", "metaData", "protocol"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just put all the validation for the schema in test-utils and call it in both the default and sync client so we don't have so much code duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will do
What changes are proposed in this pull request?
Adds a new read_parquet_schema method to the ParquetHandler trait that reads the Parquet file footer to extract the schema.
read_parquet_schemaforSyncParquetHandler(synchronous file system reads)read_parquet_schemaforDefaultParquetHandler(async with object store and presigned URL support)How was this change tested?
Unit tests testing we can read checkpoint schema