Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Nov 19, 2025

What changes are proposed in this pull request?

Adds a new read_parquet_schema method to the ParquetHandler trait that reads the Parquet file footer to extract the schema.

  • Implemented read_parquet_schema for SyncParquetHandler (synchronous file system reads)
  • Implemented read_parquet_schema for DefaultParquetHandler (async with object store and presigned URL support)

How was this change tested?

Unit tests testing we can read checkpoint schema

@DrakeLin DrakeLin changed the title feat!: Get Parquet Schema feat!: Add get_parquet_schema API to ParquetHandler Nov 19, 2025
@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 86.38498% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.84%. Comparing base (87d2844) to head (93050c0).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/parquet.rs 85.86% 10 Missing and 3 partials ⚠️
kernel/src/utils.rs 80.00% 7 Missing and 2 partials ⚠️
kernel/src/engine/sync/parquet.rs 90.78% 1 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1498      +/-   ##
==========================================
+ Coverage   84.77%   84.84%   +0.07%     
==========================================
  Files         126      126              
  Lines       35755    35967     +212     
  Branches    35755    35967     +212     
==========================================
+ Hits        30310    30515     +205     
+ Misses       3973     3966       -7     
- Partials     1472     1486      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DrakeLin DrakeLin requested a review from nicklan November 20, 2025 23:23
@DrakeLin DrakeLin requested review from emkornfield and scovich and removed request for scovich November 20, 2025 23:34
/// - The file is not a valid Parquet file
/// - The footer cannot be read or parsed
/// - The schema cannot be converted to Delta Kernel's format
fn get_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only ever need to read one footer or should this be a batch operation?

Second question, is on performance/caching. I assume we will retrieve the footer and then subsequently do a read on the files. Are we just assuming the engines will do any caching necessary here? Did you consider the alternative to do add an API like:

trait ParquetFile {
    get_schema() -> SchemaRef
}
...
fn read_files {
   &self,
    files: &[Box<dyn ParquetFile>],
        physical_schema: SchemaRef,
        predicate: Option<PredicateRef>,
}

To make it so engines have an easier time caching the necessary info. Maybe more of a question for @nicklan

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For now, this api is used to fetch checkpoint schema so we don't need to batch.
  2. This is a great idea. I'll leave it as a followup for now since this requires introducing ParquetFiles / schemas to Kernel.

// Verify this is a checkpoint schema with expected fields
let field_names: Vec<&String> = schema.fields().map(|f| f.name()).collect();
assert!(
field_names.iter().any(|&name| name == "txn"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably test more explicitly on the exact schema we expect instead of just the presence of these fields? In particular nesting.

Copy link
Collaborator

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main question is on API design and field IDs (which if included we should add tests on). Secondary concern is making the tests more robust for the implementations to ensure we are reading exactly what we expect rather then just field presence.

@DrakeLin DrakeLin requested a review from emkornfield December 9, 2025 00:06
@DrakeLin DrakeLin changed the title feat!: Add get_parquet_schema API to ParquetHandler feat!: Add read_parquet_schema API to ParquetHandler Dec 9, 2025
/// [`StructField`]: crate::schema::StructField
/// [`StructField::get_config_value`]: crate::schema::StructField::get_config_value
/// [`ColumnMetadataKey::ParquetFieldId`]: crate::schema::ColumnMetadataKey::ParquetFieldId
fn read_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will probably want to get row-group stats at some point. Happy to defer this until we need it, but I think fn read_parquet_footer -> DeltaResult<Footer>,

struct Footer {
   schema : SchemaRef
}

Could allow for less code churn in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, @nicklan does this make sense to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that makes sense and is a good change

Copy link
Collaborator Author

@DrakeLin DrakeLin Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed function return ParquetFooter

Copy link
Member

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly good, just a couple small things

/// [`StructField`]: crate::schema::StructField
/// [`StructField::get_config_value`]: crate::schema::StructField::get_config_value
/// [`ColumnMetadataKey::ParquetFieldId`]: crate::schema::ColumnMetadataKey::ParquetFieldId
fn read_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that makes sense and is a good change

.clone()
};

let top_level_fields = ["txn", "add", "remove", "metaData", "protocol"];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just put all the validation for the schema in test-utils and call it in both the default and sync client so we don't have so much code duplication?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will do

@DrakeLin DrakeLin added the breaking-change Change that require a major version bump label Dec 9, 2025
@DrakeLin DrakeLin merged commit 3bf36d5 into delta-io:main Dec 9, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants