-
Notifications
You must be signed in to change notification settings - Fork 344
feat(core): Add support for _file column
#1824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
aab78d6
ee21cab
37b52e2
44463a0
b5449f6
e034009
4f0a4f1
51f76d3
d84e16b
984dacd
bd478cb
8593db0
9b186c7
30ae5fb
adf0da0
f4336a8
ef3a965
534490b
04bf463
9e88edf
060b45d
8572dae
f273add
5aa92ae
c05b886
33bb0ad
42167ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -96,4 +96,5 @@ mod utils; | |
| pub mod writer; | ||
|
|
||
| mod delete_vector; | ||
| pub mod metadata_columns; | ||
| pub mod puffin; | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| //! Metadata columns (virtual/reserved fields) for Iceberg tables. | ||
| //! | ||
| //! This module defines metadata columns that can be requested in projections | ||
| //! but are not stored in data files. Instead, they are computed on-the-fly | ||
| //! during reading. Examples include the _file column (file path) and future | ||
| //! columns like partition values or row numbers. | ||
|
|
||
| use crate::{Error, ErrorKind, Result}; | ||
|
|
||
| /// Reserved field ID for the file path (_file) column per Iceberg spec | ||
| pub const RESERVED_FIELD_ID_FILE: i32 = 2147483646; | ||
|
|
||
| /// Reserved column name for the file path metadata column | ||
| pub const RESERVED_COL_NAME_FILE: &str = "_file"; | ||
|
|
||
| /// Returns the column name for a metadata field ID. | ||
| /// | ||
| /// # Arguments | ||
| /// * `field_id` - The metadata field ID | ||
| /// | ||
| /// # Returns | ||
| /// The name of the metadata column, or an error if the field ID is not recognized | ||
| pub fn get_metadata_column_name(field_id: i32) -> Result<&'static str> { | ||
| match field_id { | ||
| RESERVED_FIELD_ID_FILE => Ok(RESERVED_COL_NAME_FILE), | ||
| _ => { | ||
| if field_id > 2147483447 { | ||
| Err(Error::new( | ||
| ErrorKind::Unexpected, | ||
| format!("Unsupported metadata field ID: {field_id}"), | ||
| )) | ||
| } else { | ||
| Err(Error::new( | ||
| ErrorKind::Unexpected, | ||
| format!("Field ID {field_id} is not a metadata field"), | ||
| )) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /// Returns the field ID for a metadata column name. | ||
| /// | ||
| /// # Arguments | ||
| /// * `column_name` - The metadata column name | ||
| /// | ||
| /// # Returns | ||
| /// The field ID of the metadata column, or an error if the column name is not recognized | ||
| pub fn get_metadata_field_id(column_name: &str) -> Result<i32> { | ||
| match column_name { | ||
| RESERVED_COL_NAME_FILE => Ok(RESERVED_FIELD_ID_FILE), | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wish that we could somehow reuse the mapping from
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Possible, but I don't think it pays off for a hand full of fields. |
||
| _ => Err(Error::new( | ||
| ErrorKind::Unexpected, | ||
| format!("Unknown/unsupported metadata column name: {column_name}"), | ||
| )), | ||
| } | ||
| } | ||
|
|
||
| /// Checks if a field ID is a metadata field. | ||
| /// | ||
| /// # Arguments | ||
| /// * `field_id` - The field ID to check | ||
| /// | ||
| /// # Returns | ||
| /// `true` if the field ID is a (currently supported) metadata field, `false` otherwise | ||
| pub fn is_metadata_field(field_id: i32) -> bool { | ||
| field_id == RESERVED_FIELD_ID_FILE | ||
gbrgr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| // Additional metadata fields can be checked here in the future | ||
| } | ||
|
|
||
| /// Checks if a column name is a metadata column. | ||
| /// | ||
| /// # Arguments | ||
| /// * `column_name` - The column name to check | ||
| /// | ||
| /// # Returns | ||
| /// `true` if the column name is a metadata column, `false` otherwise | ||
gbrgr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| pub fn is_metadata_column_name(column_name: &str) -> bool { | ||
| get_metadata_field_id(column_name).is_ok() | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we in general encode constant columns as REE? Or should we make this custom per field? For the file path it definitely makes sense to run-end encode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit random to me that only strings use REE in
primitive_literal_to_arrow_typebelow. Yes, they might use the most memory otherwise, but other types also have similar kind of redundancy suitable for the REE.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is why I am pointing it out here. If the reviewers are fine with it, I'd go with REE everywhere.