feat!: Add read_parquet_schema API to ParquetHandler #1498

DrakeLin · 2025-11-19T01:50:48Z

What changes are proposed in this pull request?

Adds a new read_parquet_schema method to the ParquetHandler trait that reads the Parquet file footer to extract the schema.

Implemented read_parquet_schema for SyncParquetHandler (synchronous file system reads)
Implemented read_parquet_schema for DefaultParquetHandler (async with object store and presigned URL support)

How was this change tested?

Unit tests testing we can read checkpoint schema

codecov · 2025-11-19T01:52:59Z

Codecov Report

❌ Patch coverage is 86.38498% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.84%. Comparing base (87d2844) to head (93050c0).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/engine/default/parquet.rs	85.86%	10 Missing and 3 partials ⚠️
kernel/src/utils.rs	80.00%	7 Missing and 2 partials ⚠️
kernel/src/engine/sync/parquet.rs	90.78%	1 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1498      +/-   ##
==========================================
+ Coverage   84.77%   84.84%   +0.07%     
==========================================
  Files         126      126              
  Lines       35755    35967     +212     
  Branches    35755    35967     +212     
==========================================
+ Hits        30310    30515     +205     
+ Misses       3973     3966       -7     
- Partials     1472     1486      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

emkornfield · 2025-12-03T04:33:57Z

kernel/src/lib.rs

+    /// - The file is not a valid Parquet file
+    /// - The footer cannot be read or parsed
+    /// - The schema cannot be converted to Delta Kernel's format
+    fn get_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>;


Do we only ever need to read one footer or should this be a batch operation?

Second question, is on performance/caching. I assume we will retrieve the footer and then subsequently do a read on the files. Are we just assuming the engines will do any caching necessary here? Did you consider the alternative to do add an API like:

trait ParquetFile { get_schema() -> SchemaRef }

... fn read_files { &self, files: &[Box<dyn ParquetFile>], physical_schema: SchemaRef, predicate: Option<PredicateRef>, }

To make it so engines have an easier time caching the necessary info. Maybe more of a question for @nicklan

For now, this api is used to fetch checkpoint schema so we don't need to batch.

This is a great idea. I'll leave it as a followup for now since this requires introducing ParquetFiles / schemas to Kernel.

kernel/src/engine/sync/parquet.rs

emkornfield · 2025-12-03T04:37:19Z

kernel/src/engine/default/parquet.rs

+        // Verify this is a checkpoint schema with expected fields
+        let field_names: Vec<&String> = schema.fields().map(|f| f.name()).collect();
+        assert!(
+            field_names.iter().any(|&name| name == "txn"),


we should probably test more explicitly on the exact schema we expect instead of just the presence of these fields? In particular nesting.

kernel/src/lib.rs

emkornfield

Main question is on API design and field IDs (which if included we should add tests on). Secondary concern is making the tests more robust for the implementations to ensure we are reading exactly what we expect rather then just field presence.

emkornfield · 2025-12-09T19:12:44Z

kernel/src/lib.rs

+    /// [`StructField`]: crate::schema::StructField
+    /// [`StructField::get_config_value`]: crate::schema::StructField::get_config_value
+    /// [`ColumnMetadataKey::ParquetFieldId`]: crate::schema::ColumnMetadataKey::ParquetFieldId
+    fn read_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>;


I think we will probably want to get row-group stats at some point. Happy to defer this until we need it, but I think fn read_parquet_footer -> DeltaResult<Footer>,

struct Footer { schema : SchemaRef }

Could allow for less code churn in the future.

This is a good point, @nicklan does this make sense to you?

Yeah, I think that makes sense and is a good change

Changed function return ParquetFooter

nicklan

mostly good, just a couple small things

nicklan · 2025-12-09T20:17:43Z

kernel/src/lib.rs

+    /// [`StructField`]: crate::schema::StructField
+    /// [`StructField::get_config_value`]: crate::schema::StructField::get_config_value
+    /// [`ColumnMetadataKey::ParquetFieldId`]: crate::schema::ColumnMetadataKey::ParquetFieldId
+    fn read_parquet_schema(&self, file: &FileMeta) -> DeltaResult<SchemaRef>;


Yeah, I think that makes sense and is a good change

kernel/src/engine/default/parquet.rs

nicklan · 2025-12-09T20:25:48Z

kernel/src/engine/default/parquet.rs

+                .clone()
+        };
+
+        let top_level_fields = ["txn", "add", "remove", "metaData", "protocol"];


can we just put all the validation for the schema in test-utils and call it in both the default and sync client so we don't have so much code duplication?

Good point, will do

vibed

0ce372f

github-actions bot assigned DrakeLin Nov 19, 2025

DrakeLin changed the title ~~feat!: Get Parquet Schema~~ feat!: Add get_parquet_schema API to ParquetHandler Nov 19, 2025

DrakeLin added 6 commits November 19, 2025 02:40

test

9335a66

fix

c3b540b

tests

69766a0

smaller

62eee21

fix

8dcf0ce

parquet

131f2ec

DrakeLin requested a review from nicklan November 20, 2025 23:23

DrakeLin added 2 commits November 20, 2025 23:25

fix

61d4905

rm

ecb6d0c

DrakeLin requested review from emkornfield and scovich and removed request for scovich November 20, 2025 23:34

emkornfield reviewed Dec 3, 2025

View reviewed changes

kernel/src/engine/sync/parquet.rs Outdated Show resolved Hide resolved

emkornfield reviewed Dec 3, 2025

View reviewed changes

kernel/src/lib.rs Outdated Show resolved Hide resolved

emkornfield requested changes Dec 3, 2025

View reviewed changes

DrakeLin added 2 commits December 8, 2025 23:39

fix

825646f

read

adbe7db

DrakeLin requested a review from emkornfield December 9, 2025 00:06

DrakeLin changed the title ~~feat!: Add get_parquet_schema API to ParquetHandler~~ feat!: Add read_parquet_schema API to ParquetHandler Dec 9, 2025

emkornfield reviewed Dec 9, 2025

View reviewed changes

emkornfield approved these changes Dec 9, 2025

View reviewed changes

fix

b134b6d

nicklan approved these changes Dec 9, 2025

View reviewed changes

address reviewrs

1006b86

DrakeLin added 3 commits December 9, 2025 21:52

fix

ffe67bf

refactor

4ca12e0

Merge branch 'main' into drake-lin_data/parquet_schema_api

93050c0

DrakeLin added the breaking-change Change that require a major version bump label Dec 9, 2025

DrakeLin merged commit 3bf36d5 into delta-io:main Dec 9, 2025
22 checks passed

feat!: Add read_parquet_schema API to ParquetHandler #1498

feat!: Add read_parquet_schema API to ParquetHandler #1498

Conversation

DrakeLin commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this change tested?

Uh oh!

codecov bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DrakeLin Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DrakeLin commented Nov 19, 2025 •

edited

Loading

codecov bot commented Nov 19, 2025 •

edited

Loading

DrakeLin Dec 9, 2025 •

edited

Loading