Skip to content

Conversation

@spilchen
Copy link
Contributor

@spilchen spilchen commented Nov 6, 2025

Implements CombineFileInfo(), a coordinator utility that aggregates SST metadata from distributed workers and determines merge task spans based on sampled keys.

The function combines SST file metadata from multiple workers and uses their row samples to split schema spans into merge task spans. This will be used by the new distributed merge pipeline.

Resolves: #156662
Epic: CRDB-48845

Release note: none

Co-authored by: @jeffswenson

@spilchen spilchen self-assigned this Nov 6, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spilchen spilchen force-pushed the gh-156662/251105/1157/sst-combine-and-sample/pr-ready branch from f3ed057 to 51fed3f Compare November 6, 2025 17:07
@spilchen spilchen marked this pull request as ready for review November 6, 2025 19:55
@spilchen spilchen requested a review from a team as a code owner November 6, 2025 19:55
@spilchen spilchen requested review from jeffswenson, kev-cao and mw5h and removed request for a team November 6, 2025 19:55
Copy link
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mw5h reviewed 1 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson and @kev-cao)


pkg/sql/bulksst/combine_file_info.go line 17 at r1 (raw file):

)

// CombineFileInfo combines SST file metadata and determines merge task spans based on key samples.

Perhaps we should describe what we're producing here and why? There's not really an in-code description of what the samples are, so it's not necessarily clear why we want to use them as break points for our merge spans.

…leInfo

Implements CombineFileInfo(), a coordinator utility that aggregates SST
metadata from distributed workers and determines merge task spans based on
sampled keys.

The function combines SST file metadata from multiple workers and uses their
row samples to split schema spans into merge task spans. This will be used by
the new distributed merge pipeline.

Resolves: cockroachdb#156662
Epic: CRDB-48845

Release note: none

Co-authored by: @jeffswenson
@spilchen spilchen force-pushed the gh-156662/251105/1157/sst-combine-and-sample/pr-ready branch from 51fed3f to 1fc8f79 Compare November 13, 2025 12:22
Copy link
Contributor Author

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @kev-cao, and @mw5h)


pkg/sql/bulksst/combine_file_info.go line 17 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Perhaps we should describe what we're producing here and why? There's not really an in-code description of what the samples are, so it's not necessarily clear why we want to use them as break points for our merge spans.

Good idea. I beefed up the function comment.

Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@spilchen spilchen requested a review from mw5h November 17, 2025 18:07
@spilchen
Copy link
Contributor Author

@mw5h ready for another look

Copy link
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mw5h reviewed 2 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kev-cao)

@spilchen
Copy link
Contributor Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 18, 2025

@craig craig bot merged commit a325c69 into cockroachdb:master Nov 18, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql/bulksst: add SST metadata combination and sampling utilities

4 participants