Skip to content

Conversation

@alexlarsson
Copy link
Contributor

Based on ideas from #141

This is an initial version of ostree support. This allows pulling
from local and remote ostree repos, which will create a set of
regular file content objects, as well as a blob containing all the
remaining ostree objects. From the blob we can create an image.

When pulling a commit, a base blob (i.e. "the previous version" can be
specified. Any objects in that base blob will not be downloaded. If a
name is given for the pulled commit, then pre-existing blobs with the
same name will automatically be used as a base blob.

This is an initial version and there are several things missing:

  • Pull operations are completely serial
  • There is no support for ostree summary files
  • There is no support for ostree delta files
  • There is no caching of local file availability (other than base blob)
  • Local ostree repos only support archive mode

@alexlarsson alexlarsson force-pushed the ostree-support branch 2 times, most recently from e0e827f to 9c5b086 Compare June 17, 2025 06:54
Copy link
Collaborator

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this! Thanks for working on it!

I made some comments on the first round of commits. Feel free to adjust those and PR them separately: we can merge those now without further discussion.

The blobs thing is going to need a call.

I didn't review the crate addition in any detail at all. That's probably also going to need a call :)

}

symlinkat(relative, &self.repository, name)
// Atomically replace existing symlink
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. flatpak-rs currently has a workaround for exactly this reason, which I'd love to rip out: https://github.com/allisonkarlitskaya/flatpak-rs/blob/8e1741d06b18a450536297a334fc347b010639ba/src/install.rs#L19

Can't we get some sort of kernel love here? I've generally tried to avoid this sort of nonsense (by way of O_TMPFILE)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic rename over symlinks is the canonical way to do this atm. Don't think there is anything better.

Comment on lines 438 to 439
let mut count = 0;
let tmp_name = loop {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit like this could be better handled with a for loop. I like the let tmp_name = loop { } thing, of course (which you can't do with for) but after you break the loop, the only thing you do is a renameat()... so that could probably be brought into the loop body.

You could also make a utility function that was responsible for creating the symlink and used return to escape the for loop, returning the value. Having this inline already feels like "a bit too much" here...

}
};

renameat(&self.repository, &tmp_name, &self.repository, name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No cleanup of the temporary here in case the rename fails...

let file = File::from(if let Some(verity_hash) = verity {
self.open_with_verity(&filename, verity_hash)?
self.open_with_verity(&filename, verity_hash)
.with_context(|| format!("Opening ref 'streams/{name}'"))?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll never escape anyhow 🤣

Agreed that having this is better than not, though.

let mut objects = HashSet::new();

let category_fd = self.openat(category, OFlags::RDONLY | OFlags::DIRECTORY)?;
let category_fd = match self.openat(category, OFlags::RDONLY | OFlags::DIRECTORY) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider lifting this from flatpak-rs into our own utils module?

https://github.com/allisonkarlitskaya/flatpak-rs/blob/8e1741d06b18a450536297a334fc347b010639ba/src/sandbox/util.rs#L34

I'm sure we could find a couple more uses around the code for this...

@@ -1,3 +1,4 @@
pub mod blob;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not keen on this addition. We store OCI config json in splitstreams as well (as a single inline block) specifically because I wanted to avoid creating a separate file format. If there is a missing feature there, I'd rather extend splitstreams a bit rather than adding yet another object type to the repository. We'll need to talk about this...


[features]
default = ['pre-6.15', 'oci']
default = ['pre-6.15', 'oci','ostree']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space. surprised some sort of linter didn't pick on that...

pub fn has_named_blob(&self, name: &str) -> bool {
let blob_path = format!("blobs/refs/{}", name);

match readlinkat(&self.repository, &blob_path, []) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes in a previous commit, obviously...

@alexlarsson
Copy link
Contributor Author

Hmmm, thinking more about this. We probably want a "content type" magic thing in the splitstream header as well, so we can error out if the wrapped thing is of the wrong type.

@alexlarsson alexlarsson force-pushed the ostree-support branch 2 times, most recently from 2ed83a2 to c041afe Compare June 19, 2025 09:11
@alexlarsson
Copy link
Contributor Author

Ok. Reworked this to use splitstreams for object maps and commits. And, by using an object mapping to find the object map we make the content of the splitstream for the commit be just the commit data, and thus the sha256 of that splitstream matches the ostree commit id.

@alexlarsson
Copy link
Contributor Author

@allisonkarlitskaya There is still lots to do here. But have a look at this approach and see what you think.

@alexlarsson
Copy link
Contributor Author

Added some further changes. We now validate all objects when pulling and all non-file objects when creating images. Its hard to efficiently validate file objects during create-image though, we would like to avoid re-reading the external object files to compute the sha256.

Remaining things to do:

  • Stream larger objects into repo
  • Support summaries and summary branches for remote repos
  • Support deltas when remote pulling
  • Parallelize downloads of objects
  • Report pull progress in some sane way
  • Use some kind of local cache for available objects other than just those from "previous version"
  • Handle GPG validation of commit objects

@alexlarsson alexlarsson force-pushed the ostree-support branch 4 times, most recently from 481e604 to e88573d Compare June 30, 2025 14:26
@alexlarsson
Copy link
Contributor Author

I started working on the delta support, but it failed because of an issue in gvariant-rs.

Copy link
Collaborator

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that it might be interesting not to sort the table of fs-verity references, and it might also be interesting to permit duplicate items.

On the topic of deferring writing of objects to a background thread, this would allow us to write "external object #123" based on a sequential index to the splitstream without actually knowing the hash value yet, and then fill in the actual values in the header at the end when we're writing: it helps there that the fs-verity references aren't compressed and therefore not part of the stream...

@cgwalters
Copy link
Collaborator

It seems like we should get in the splitstream changes in 0f6d69e at least sooner rather than later? Can you file a separate PR?

alexlarsson added a commit to alexlarsson/composefs-rs that referenced this pull request Sep 29, 2025
This changes the splitstream format a bit, with the goal of allowing
splitstreams to support ostree files as well (see containers#144)

The primary differences are:

 * The header is not compressed
 * All referenced fs-verity objects are stored in the header, including
   external chunks, mapped splitstreams and (a new feature) references
   that are not used in chunks.
 * The mapping table is separate from the reference table (and generally
   smaller), and indexes into it.
 * There is a magic value to detect the file format.
 * There is a magic content type to detect the type wrapped in the stream.
 * We store a tag for what ObjectID format is used
 * The total size of the stream is stored in the header.

The ability to reference file objects in the repo even if they are not
part of the splitstream "content" will be useful for the ostree
support to reference file content objects.

This change also allows more efficient GC enumeration, because we
don't have to parse the entire splitstream to find the referenced
objects.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
alexlarsson added a commit to alexlarsson/composefs-rs that referenced this pull request Sep 29, 2025
This changes the splitstream format a bit, with the goal of allowing
splitstreams to support ostree files as well (see containers#144)

The primary differences are:

 * The header is not compressed
 * All referenced fs-verity objects are stored in the header, including
   external chunks, mapped splitstreams and (a new feature) references
   that are not used in chunks.
 * The mapping table is separate from the reference table (and generally
   smaller), and indexes into it.
 * There is a magic value to detect the file format.
 * There is a magic content type to detect the type wrapped in the stream.
 * We store a tag for what ObjectID format is used
 * The total size of the stream is stored in the header.

The ability to reference file objects in the repo even if they are not
part of the splitstream "content" will be useful for the ostree
support to reference file content objects.

This change also allows more efficient GC enumeration, because we
don't have to parse the entire splitstream to find the referenced
objects.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
alexlarsson added a commit to alexlarsson/composefs-rs that referenced this pull request Oct 6, 2025
This changes the splitstream format a bit, with the goal of allowing
splitstreams to support ostree files as well (see containers#144)

The primary differences are:

 * The header is not compressed
 * All referenced fs-verity objects are stored in the header, including
   external chunks, mapped splitstreams and (a new feature) references
   that are not used in chunks.
 * The mapping table is separate from the reference table (and generally
   smaller), and indexes into it.
 * There is a magic value to detect the file format.
 * There is a magic content type to detect the type wrapped in the stream.
 * We store a tag for what ObjectID format is used
 * The total size of the stream is stored in the header.

The ability to reference file objects in the repo even if they are not
part of the splitstream "content" will be useful for the ostree
support to reference file content objects.

This change also allows more efficient GC enumeration, because we
don't have to parse the entire splitstream to find the referenced
objects.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
@alexlarsson alexlarsson force-pushed the ostree-support branch 2 times, most recently from c788da2 to 2ee193a Compare October 6, 2025 14:58
alexlarsson added a commit to alexlarsson/composefs-rs that referenced this pull request Oct 6, 2025
This changes the splitstream format a bit, with the goal of allowing
splitstreams to support ostree files as well (see containers#144)

The primary differences are:

 * The header is not compressed
 * All referenced fs-verity objects are stored in the header, including
   external chunks, mapped splitstreams and (a new feature) references
   that are not used in chunks.
 * The mapping table is separate from the reference table (and generally
   smaller), and indexes into it.
 * There is a magic value to detect the file format.
 * There is a magic content type to detect the type wrapped in the stream.
 * We store a tag for what ObjectID format is used
 * The total size of the stream is stored in the header.

The ability to reference file objects in the repo even if they are not
part of the splitstream "content" will be useful for the ostree
support to reference file content objects.

This change also allows more efficient GC enumeration, because we
don't have to parse the entire splitstream to find the referenced
objects.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
allisonkarlitskaya pushed a commit to allisonkarlitskaya/composefs-rs that referenced this pull request Nov 12, 2025
This changes the splitstream format a bit, with the goal of allowing
splitstreams to support ostree files as well (see containers#144)

The primary differences are:

 * The header is not compressed
 * All referenced fs-verity objects are stored in the header, including
   external chunks, mapped splitstreams and (a new feature) references
   that are not used in chunks.
 * The mapping table is separate from the reference table (and generally
   smaller), and indexes into it.
 * There is a magic value to detect the file format.
 * There is a magic content type to detect the type wrapped in the stream.
 * We store a tag for what ObjectID format is used
 * The total size of the stream is stored in the header.

The ability to reference file objects in the repo even if they are not
part of the splitstream "content" will be useful for the ostree
support to reference file content objects.

This change also allows more efficient GC enumeration, because we
don't have to parse the entire splitstream to find the referenced
objects.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
We need some new features in systemd-repart and mkfs.ext4.  We were
pulling those from feature branches and commit IDs in the past, but
these features are now available in stable releases.  Build those
release versions instead.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This patch is machine-generated.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
allisonkarlitskaya and others added 10 commits November 20, 2025 13:50
This is only used from tests and not exported, so conditionalize it.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This wasn't yet stabilized when the code was first written but newer
patches have already added the use of this function in other parts of
the same file, so this ought to be safe by now.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This comment is an overview, so move it to a higher level.  Change it a
bit to make it more accurate.

Also: move the declaration of the buffer outside of the loop body to
avoid having to re-zero it each time.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
We're going to start referring to OCI images by their names starting
with `sha256:` (as podman names them in the `--iidfile`) soon.  Skopeo
doesn't like that, so add a workaround.

This will soon let us get rid of some of the hacking about we do in our
`examples/` build scripts.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Let's just have users write the padding as a separate inline section
after they write the external data.  This makes things a lot easier and
reduces thrashing of the internal buffer.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This is a substantial change to the splitstream file format to add more
features (required for ostree support) and to add forwards- and
backwards- compatibility mechanisms for future changes.  This change
aims to finalize the file format so we can start shipping this to the
systems of real users without future "breaks everything" changes.

This change itself breaks everything: you'll need to delete your
repository and start over.  Hopefully this is the last time.

The file format is substantially more fleshed-out at this point.  Here's
an overview of the changes:

  - there is a header with a magic value, a version. flags field, and the
    fs-verity algorithm number and block size in use

  - everything else in the file can be freely located which will help if
    we ever want to create a version of the writer that streams data to
    disk as it goes: in that case we may want to store the stream before
    the associated metadata

  - there is an expandable "info" section which contains most other
    information about the stream and is intended to be used as the primary
    mechanism for making compatible changes to the file format in the
    future

  - the info section stores the total decompressed/reassembled stream
    size and a unique identifier value for the file type stored in the
    stream

  - the referenced external objects and splitstreams are now stored in a
    flat array of binary fs-verity hash values to improve the performance
    of garbage collection operations in large repositories (informed by
    Alex's battlescars from dealing with GC on Flathub)

  - it is possible to add arbitrary external object and stream references

  - the "sha256 mapping" has been replaced with a more flexible "named
    stream refs" mechanism that allows assigning arbitrary names to
    associated streams.  This will be useful if we ever want to support
    formats that are based on anything other than SHA-256 (including
    future OCI versions which may start using SHA-512 or something else).

  - whereas the previous implementation concerned itself with ensuring
    the correct SHA-256 content hash of the stream and creating a link to
    the stream with that hash value from the `streams/` directory, the new
    implementation requires that the user perform whatever hashing they
    consider appropriate and name their streams with a "content
    identifier".

    This change, taken together with the above change, removes all SHA-256
    specific logic from the implementation.

    The main reason for this change is that a SHA-256 content hash over a
    file isn't a sufficiently unique identifier to locate the relevant
    splitstream for that file.  Each different file type is split into a
    splitstream in a different way.  It just so happens that OCI JSON
    documents, `.tar` files, and GVariant OSTree commit objects have no
    possible overlaps (which means that SHA-256 content hashes have
    uniquely identified the files up to this point), but this is mostly a
    coincidence.  Each file type is now responsible to name its streams
    with a sufficiently unique "content identifier" based on the component
    name, the file name, and a content hash, for example:

      - `oci-commit-sha256:...`
      - `oci-layer-sha256:...`
      - `ostree-commit-...`
      - &c.

    Having the repository itself no longer care about the content hashes
    means that the OCI code can now trust the SHA-256 verification
    performed by skopeo, and we don't need to recompute it, which is a
    nice win.

Update the file format documentation.

Update the repository code and the users of splitstream (ie: OCI) to
adjust to the post-sha256-hardcoded future.

Adjust the way we deal with verification of OCI objects when we lack
fs-verity digests: instead of having an "open" operation which verifies
everything and a "shallow open" which doesn't, just have the open
operation verify only the config and move the verification of the layers
to when we access them.

Co-authored-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
The `fdatasync()` per written object thing is an unmitigated performance
disaster and we need to get rid of it.  It forces the filesystem to
create thousands upon thousands of tiny commits.

I tried another approach where we would reopen and `fsync()` each object
file referred to from a splitstream *after* all of the files were
written, but before we wrote the splitstream data, but it was also quite
bad.

syncfs() is a really really dumb hammer, and it could get us into
trouble if other users on the system have massive outstanding amounts of
IO.  On the other hand, it works: it's almost instantaneous, and is a
massive performance improvement over what we were doing before.  Let's
just go with that for now.

Maybe some day filesystems will have a mechanism for this which isn't
horrible.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Signed-off-by: Alexander Larsson <alexl@redhat.com>
This lets you look up a ref digest from the splitstream by index
and is needed by the ostree code.
Based on ideas from containers#141

This is an initial version of ostree support. This allows pulling from
local and remote ostree repos, which will create a set of regular file
content objects, as well as a commit splitstream containing all the
remaining ostree objects and file data. From the splitstream we can
create an image.

When pulling a commit, a base commit (i.e. "the previous version" can be
specified. Any objects in that base commit will not be downloaded. If a
name is given for the pulled commit, then pre-existing blobs with the
same name will automatically be used as a base commit.

This is an initial version and there are several things missing:
 * Pull operations are completely serial
 * There is no support for ostree summary files
 * There is no support for ostree delta files
 * There is no caching of local file availability (other than base commit)
 * Local ostree repos only support archive mode
 * There is no GPG validation on ostree pull

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants