-
Notifications
You must be signed in to change notification settings - Fork 127
feat: Add CheckpointManifestReader to process sidecar files #1500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| )) | ||
| ); | ||
|
|
||
| let sidecars: Vec<_> = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation shouldn't be necessary here because AfterManifest::Sidecars forces it
| let sidecars: Vec<_> = self | |
| let sidecars = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly the sidecars.is_empty() somehow doesn't let the type inference work :(
698da9b to
5db45f5
Compare
e54ba02 to
58604ff
Compare
|
Note: Seems that arrow stuff is causing errors in the CI. Will fix in a separate Pr. |
nicklan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks great. I do wonder if there could be some way to avoid the pattern of "read the iter -> then call a final method to get the sidecars". I think it would be nice if it could just be that the final element of the iterator is the sidecars, but I realize the types don't quite line up. Maybe worth thinking about if we could use Either or something else to make that flow work, and if you think that's better.
kernel/src/log_reader/manifest.rs
Outdated
| /// Phase that processes single-part checkpoint. This also treats the checkpoint as a manifest file | ||
| /// and extracts the sidecar actions during iteration. | ||
| #[allow(unused)] | ||
| pub(crate) struct ManifestPhase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the name ManifestPhase, because it doesn't tell you what it's a phase of. Could this just be a CheckpointManifestProcessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with CheckpointManifestReader
kernel/src/log_reader/manifest.rs
Outdated
| /// - `Sidecars`: Extracted sidecar files | ||
| /// - `Done`: No sidecars found | ||
| #[allow(unused)] | ||
| pub(crate) fn finalize(self) -> DeltaResult<AfterManifest> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, it's not clear what we're "finalizing" here. I'd maybe prefer to just call this get_extracted_sidecars or something like that, and it can return Option<Vec<FileMeta>>, unless we think this will return something beyond sidecars or "done" in the future.
Regardless let's make very clear in the doc comment that the iterator needs to be exhausted before you can call this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, simplified to just Vec<FileMeta>. Added docs as well
a6e206c to
563a6d2
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1500 +/- ##
========================================
Coverage 84.77% 84.78%
========================================
Files 126 128 +2
Lines 35679 35897 +218
Branches 35679 35897 +218
========================================
+ Hits 30246 30434 +188
- Misses 3965 3972 +7
- Partials 1468 1491 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
77c7f93 to
a534df5
Compare
kernel/src/log_reader/commit.rs
Outdated
| use std::sync::Arc; | ||
| use url::Url; | ||
|
|
||
| fn load_test_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: this was moved to utils, but also the semantics changed a bit. Now it tries treating the test file as zipped. If that fails, it tries reading as if it isn't zipped.
nicklan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
| let log_root = log_segment.log_root.clone(); | ||
| assert_eq!(log_segment.checkpoint_parts.len(), 1); | ||
| let checkpoint_file = &log_segment.checkpoint_parts[0]; | ||
| let mut manifest_phase = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: variable name doesn't make sense anymore :)
| let mut manifest_phase = | |
| let mut checkpoint_manifest_reader = |
a534df5 to
011d91c
Compare
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
This PR introduces the CheckpointManifestReader. This is a LogReader that processes manifest files and extracts sidecar paths for later processing. Upon completion, the CheckpointManifestReader returns either:
How was this change tested?
The following cases are checked: