-
Notifications
You must be signed in to change notification settings - Fork 157
doc: Add a explanation of Git's data model #1981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
There are issues in commit 31993be: |
|
There are issues in commit c3ff12a: |
|
There are issues in commit bfcc916: |
f7eadcf to
fcbd21b
Compare
|
/submit |
|
Submitted as pull.1981.git.1759512876284.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
On the Git mailing list, "Kristoffer Haugsbakk" wrote (reply to this): On Fri, Oct 3, 2025, at 19:34, Julia Evans via GitGitGadget wrote:
> From: Julia Evans <julia@jvns.ca>
>
> Git very often uses the terms "object", "reference", or "index" in its
> documentation.
>
> However, it's hard to find a clear explanation of these terms and how
> they relate to each other in the documentation. The closest candidates
> currently are:
>
> 1. `gitglossary`. This makes a good effort, but it's an alphabetically
> ordered dictionary and a dictionary is not a good way to learn
> concepts. You have to jump around too much and it's not possible to
> present the concepts in the order that they should be explained.
> 2. `gitcore-tutorial`. This explains how to use the "core" Git commands.
> This is a nice document to have, but it's not necessary to learn how
> `update-index` works to understand Git's data model, and we should
> not be requiring users to learn how to use the "plumbing" commands
> if they want to learn what the term "index" or "object" means.
> 3. `gitrepository-layout`. This is a great resource, but it includes a
> lot of information about configuration and internal implementation
> details which are not related to the data model. It also does
> not explain how commits work.
>
> The result of this is that Git users (even users who have been using
> Git for 15+ years) struggle to read the documentation because they don't
> know what the core terms mean, and it's not possible to add links
> to help them learn more.
>
> Add an explanation of Git's data model. Some choices I've made in
> deciding what "core data model" means:
>
> 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me
> if those are intended to be user facing or if they're more like
> internal implementation details.
> 2. Don't talk about submodules other than by mentioning how they
> relate to trees. This is because Git has a lot of special features,
> and explaining how they all work exhaustively could quickly go
> down a rabbit hole which would make this document less useful for
> understanding Git's core behaviour.
> 3. Don't discuss the structure of a commit message
> (first line, trailers, GPG signatures, etc).
> Perhaps this should change.
>
> Some other choices I've made:
>
> 1. Mention packed refs only in a note.
I don’t think it’s worth mentioning this at all. More on that later.
> 2. Don't mention that the full name of the branch `main` is
> technically `refs/heads/main`. This should likely change but I
> haven't worked out how to do it in a clear way yet.
I think this is worth getting into. This is a pretty
user-facing concept.
> 3. Mostly avoid referring to the `.git` directory, because the exact
> details of how things are stored change over time.
> This should perhaps change from "mostly" to "entirely"
> but I haven't worked out how to do that in a clear way yet.
I think that’s good. I mean, I think us users don’t need that level of
detail and shouldn’t be “inspired” to muck with the internals. If that
makes sense. (See later)
>
> Signed-off-by: Julia Evans <julia@jvns.ca>
> ---
> doc: Add a explanation of Git's data model
>[snip]
> diff --git a/Documentation/Makefile b/Documentation/Makefile
>[snip]
> diff --git a/Documentation/gitdatamodel.adoc
> b/Documentation/gitdatamodel.adoc
> new file mode 100644
> index 0000000000..4b2cb167dc
> --- /dev/null
> +++ b/Documentation/gitdatamodel.adoc
> @@ -0,0 +1,226 @@
> +gitdatamodel(7)
> +===============
> +
> +NAME
> +----
> +gitdatamodel - Git's core data model
> +
> +DESCRIPTION
> +-----------
> +
> +It's not necessary to understand Git's data model to use Git, but it's
> +very helpful when reading Git's documentation so that you know what it
> +means when the documentation says "object" "reference" or "index".
I haven’t gone hunting through the docs to see if this is covered
elsewhere. But the thrust of all the things here definitely feel to me
like something that should be presented and documented in such a way.
> +
> +Git's core operations use 4 kinds of data:
Maybe small numerals should be spelled as words in running text?
> +
> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
> +2. <<references,References>>: branches, tags,
> + remote-tracking branches, etc
> +3. <<index,The index>>, also known as the staging area
> +4. <<reflogs,Reflogs>>
Reflogs is certainly auxiliary ref data. What makes it qualify as
one-of-the-four? I am open to it being both, to be clear.
> +
> +[[objects]]
> +OBJECTS
> +-------
> +
> +Commits, trees, blobs, and tag objects are all stored in Git's object
> database.
> +Every object has:
> +
> +1. an *ID*, which is the SHA-1 hash of its contents.
> + It's fast to look up a Git object using its ID.
> + The ID is usually represented in hexadecimal, like
> + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
> +2. a *type*. There are 4 types of objects:
> + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
> + and <<tag-object,tag objects>>.
> +3. *contents*. The structure of the contents depends on the type.
> +
> +Once an object is created, it can never be changed.
> +Here are the 4 types of objects:
As a curious Git user this seems correct.
> +
> +[[commit]]
> +commits::
> + A commit contains:
> ++
> +1. Its *parent commit ID(s)*. The first commit in a repository has 0
> parents,
Maybe this is a subjective style thing but is it necessary to use “(s)”
when the context makes clear that it could be zero to many?
Its *parent commit IDs. ...
> + regular commits have 1 parent, merge commits have 2+ parents
s/2+/two or more/ ?
Same point as the “numeral” one above.
> +2. A *commit message*
> +3. All the *files* in the commit, stored as a *<<tree,tree>>*
> +4. An *author* and the time the commit was authored
> +5. A *committer* and the time the commit was committed
> ++
> +Here's how an example commit is stored:
> ++
> +----
> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
> +author Maya <maya@example.com> 1759173425 -0400
> +committer Maya <maya@example.com> 1759173425 -0400
> +
> +Add README
> +----
> ++
> +Like all other objects, commits can never be changed after they're
> created.
> +For example, "amending" a commit with `git commit --amend` creates a
> new commit.
> +The old commit will eventually be deleted by `git gc`.
Maybe this could be moved to a part about what happens (eventually) to
unreachable objects?
Mentioning `git gc` and how things will get deleted raises
questions naturally. Like why would they be deleted? Okay
that’s clear: the previous commit will be replaced by the
amended one. Then when it is not reachable by anything
(even the reflog) it will get garbage collected.
It all follows. But is the reader necessarily mature enough
in their understanding to make the inference?
This is a long-winded way of saying: if you’re gonna discuss
`git gc` you might need to go into all of these concepts.
> +
> +[[tree]]
> +trees::
> + A tree is how Git represents a directory. It lists, for each item
> in
> + the tree:
> ++
> +1. The *permissions*, for example `100644`
> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
> + or <<commit,`commit`>> (a Git submodule)
> +3. The *object ID*
> +4. The *filename*
> ++
> +For example, this is how a tree containing one directory (`src`) and
> one file
> +(`README.md`) is stored:
> ++
> +----
> +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
> +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
> +----
> ++
> +*NOTE:* The permissions are in the same format as UNIX permissions, but
> +the only allowed permissions for files (blobs) are 644 and 755.
> +
Makes sense.
> +[[blob]]
> +blobs::
> + A blob is how Git represents a file. A blob object contains the
> + file's contents.
> ++
> +Storing a new blob for every new version of a file can get big, so
> +`git gc` periodically compresses objects for efficiency in
> `.git/objects/pack`.
This gets into mentioning implementation files(?) like you mentioned in
the commit message.
1. That it’s a packfile and where it is might be too much detail for
this doc
2. I vaguely recall documents discussing what happens to “storing every
version” discussing deltas instead of packs? Again, I am not a Git
developer though.
> +
> +[[tag-object]]
> +tag objects::
> + Tag objects (also known as "annotated tags") contain:
> ++
> +1. The *tagger* and tag date
> +2. A *tag message*, similar to a commit message
> +3. The *ID* of the object (often a commit) that they reference
s/often/typically/ ?
I know it can get tedious to caveat the 99% cases with things that are
technically possible. Maybe if it gets “bad enough” there could be a
part that explains/distinguishes the high-level/porcelain Git use and
what is technically possible: you make a `git tag -a`, which is on a
commit... except if you accidentally run it on top of an existing
tag. Then even the porcelain won’t protect you from making a
tag-on-tag. (But it will issue a warning I guess.) Hmm. Now I don’t know.
> +
> +[[references]]
> +REFERENCES
> +----------
> +
> +References are a way to give a name to a commit.
> +It's easier to remember "the changes I'm working on are on the `turtle`
> +branch" than "the changes are in commit bb69721404348e".
> +Git often uses "ref" as shorthand for "reference".
Good.
> +
> +References that you create are stored in the `.git/refs` directory,
> +and Git has a few special internal references like `HEAD` that are
> stored
> +in the base `.git` directory.
Implementation file details.
You also mention `.git/refs/heads/<name>` below. But refs aren’t stored
as files if you are using the *reftable* backend. And that backend will
become the default for new repositories in Git 3.0, I think.
How does reftable work? I don’t know. But I don’t think we need to
know after reading this doc. :)
To be clear: how files are stored might not matter here.
> +
> +References can either be:
> +
> +1. References to an object ID, usually a <<commit,commit>> ID
> +2. References to another reference. This is called a "symbolic
> reference".
You seem to have used `**` when introducing terms:
This is a *symbolic reference*
>[snip ref stuff]
> +
> +[[HEAD]]
> +HEAD: `.git/HEAD`::
> + `HEAD` is where Git stores your current <<branch,branch>>.
> + `HEAD` is normally a symbolic reference to your current branch, for
> + example `ref: refs/heads/main` if your current branch is `main`.
> + `HEAD` can also be a direct reference to a commit ID,
> + that's called "detached HEAD state".
> +
> +[[remote-tracking-branch]]
> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
> + A remote-tracking branch is a name for a commit ID.
> + It's how Git stores the last-known state of a branch in a remote
> + repository. `git fetch` updates remote-tracking branches. When
> + `git status` says "you're up to date with origin/main", it's looking at
> + this.
Looks good.
> +
> +[[other-refs]]
> +Other references::
> + Git tools may create references in any subdirectory of `.git/refs`.
> + For example, linkgit:git-stash[1], linkgit:git-bisect[1],
> + and linkgit:git-notes[1] all create their own references
> + in `.git/refs/stash`, `.git/refs/bisect`, etc.
> + Third-party Git tools may also create their own references.
> ++
> +Git may also create references in the base `.git` directory
> +other than `HEAD`, like `ORIG_HEAD`.
> +
> +*NOTE:* As an optimization, references may be stored as packed
> +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].
I don’t know if this is relevant for both ref backends. And does it
matter?
> +
> +[[index]]
> +THE INDEX
> +---------
> +
> +The index, also known as the "staging area", contains the current
> staged
> +version of every file in your Git repository. When you commit, the
> files
> +in the index are used as the files in the next commit.
> +
> +Unlike a tree, the index is a flat list of files.
> +Each index entry has 4 fields:
> +
> +1. The *permissions*
> +2. The *<<blob,blob>> ID* of the file
> +3. The *filename*
> +4. The *number*. This is normally 0, but if there's a merge conflict
> + there can be multiple versions (with numbers 0, 1, 2, ..)
> + of the same filename in the index.
> +
> +It's extremely uncommon to look at the index directly: normally you'd
> +run `git status` to see a list of changes between the index and
> <<HEAD,HEAD>>.
> +But you can use `git ls-files --stage` to see the index.
> +Here's the output of `git ls-files --stage` in a repository with 2
> files:
> +
> +----
> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
> +----
> +
> +[[reflogs]]
> +REFLOGS
> +-------
> +
> +Git stores the history of branch, tag, and HEAD refs in a reflog
> +(you should read "reflog" as "ref log"). Not every ref is logged by
You’ve heard of the re-flog too?
> +default, but any ref can be logged.
> +
> +Each reflog entry has:
> +
> +1. *Before/after *commit IDs*
> +2. *User* who made the change, for example `Maya <maya@example.com>`
> +3. *Timestamp*
> +4. *Log message*, for example `pull: Fast-forward`
> +
> +Reflogs only log changes made in your local repository.
> +They are not shared with remotes.
Makes sense.
> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite
I appreciate that this is the first version and you might have plans
after this one. But I wonder if this doc could use a fair number of
`gitlink` to branch out to all the other parts. Like git-reflog(1),
gitglossary(7).
Thanks for starting on a whole new doc. That must take quite
some effort. |
|
User |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
> +MAN7_TXT += gitdatamodel.adoc
> MAN7_TXT += gitdiffcore.adoc
> ...
> +gitdatamodel(7)
> +===============
> +
> +NAME
> +----
> +gitdatamodel - Git's core data model
> +
> +DESCRIPTION
> +-----------
The above causes doc-lint to barf.
https://github.com/git/git/actions/runs/18265502271/job/51999236907#step:4:655
gitdatamodel.adoc:226: has no required 'SYNOPSIS' section!
LINT MAN SEC giteveryday.adoc
make[1]: *** [Makefile:498: .build/lint-docs/man-section-order/gitdatamodel.ok] Error 1
You can check locally with "make check-docs" without waiting for my
integration cycle to push to GitHub CI.
Thanks. |
|
This patch series was integrated into seen via git@56f8416. |
|
On the Git mailing list, "Julia Evans" wrote (reply to this): > The above causes doc-lint to barf.
>
> https://github.com/git/git/actions/runs/18265502271/job/51999236907#step:4:655
>
> gitdatamodel.adoc:226: has no required 'SYNOPSIS' section!
> LINT MAN SEC giteveryday.adoc
> make[1]: *** [Makefile:498:
> .build/lint-docs/man-section-order/gitdatamodel.ok] Error 1
>
>
> You can check locally with "make check-docs" without waiting for my
> integration cycle to push to GitHub CI.
Thanks, will fix. |
|
On the Git mailing list, "Julia Evans" wrote (reply to this): Thanks for the review!
>> 2. Don't mention that the full name of the branch `main` is
>> technically `refs/heads/main`. This should likely change but I
>> haven't worked out how to do it in a clear way yet.
>
> I think this is worth getting into. This is a pretty
> user-facing concept.
I think I'll see if I can figure out a way to mention this and at the
same time remove most of the rest of the references to the `.git`
directory when explaining references (which you talked about
further down), including packed refs.
>> +
>> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
>> +2. <<references,References>>: branches, tags,
>> + remote-tracking branches, etc
>> +3. <<index,The index>>, also known as the staging area
>> +4. <<reflogs,Reflogs>>
>
> Reflogs is certainly auxiliary ref data. What makes it qualify as
> one-of-the-four? I am open to it being both, to be clear.
The reason I like to talk about reflogs is that it gives you a
way to "undo" Git operations that can be really useful.
And any Git command that updates refs can updates that
ref's reflog.
Understanding how reflogs work helps to understand what the
limitations of using reflogs to undo mistakes is: for example
the index is not a ref, so you can't use the reflog to undo
changes to the index.
>> +2. A *commit message*
>> +3. All the *files* in the commit, stored as a *<<tree,tree>>*
>> +4. An *author* and the time the commit was authored
>> +5. A *committer* and the time the commit was committed
>> ++
>> +Here's how an example commit is stored:
>> ++
>> +----
>> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
>> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
>> +author Maya <maya@example.com> 1759173425 -0400
>> +committer Maya <maya@example.com> 1759173425 -0400
>> +
>> +Add README
>> +----
>> ++
>> +Like all other objects, commits can never be changed after they're
>> created.
>> +For example, "amending" a commit with `git commit --amend` creates a
>> new commit.
>
>> +The old commit will eventually be deleted by `git gc`.
>
> Maybe this could be moved to a part about what happens (eventually) to
> unreachable objects?
>
> Mentioning `git gc` and how things will get deleted raises
> questions naturally. Like why would they be deleted? Okay
> that’s clear: the previous commit will be replaced by the
> amended one. Then when it is not reachable by anything
> (even the reflog) it will get garbage collected.
>
> It all follows. But is the reader necessarily mature enough
> in their understanding to make the inference?
>
> This is a long-winded way of saying: if you’re gonna discuss
> `git gc` you might need to go into all of these concepts.
If folks here think this is a reasonable document to add to
Git I'll try get some beta readers to read this, see which parts
folks find confusing, and address those, keeping the `git gc`
stuff in mind.
Similarly for the style comments.
>> +blobs::
>> + A blob is how Git represents a file. A blob object contains the
>> + file's contents.
>> ++
>> +Storing a new blob for every new version of a file can get big, so
>> +`git gc` periodically compresses objects for efficiency in
>> `.git/objects/pack`.
>
> This gets into mentioning implementation files(?) like you mentioned in
> the commit message.
That's true! The reason I think this is important to mention is that I find
that people often "reject" information that they find implausible, even
if it comes from a credible source. ("that can't be true! I must be
not understanding correctly. Oh well, I'll just ignore that!")
I sometimes hear from users that "commits can't be snapshots", because
it would take up too much disk space to store every version of
every commit. So I find that sometimes explaining a little bit about the
implementation can make the information more memorable.
Certainly I'm not able to remember details that don't make sense
with my mental model of how computers work and I don't expect other
people to either, so I think it's important to give an explanation that
handles the biggest "objections".
> 1. That it’s a packfile and where it is might be too much detail for
> this doc
> 2. I vaguely recall documents discussing what happens to “storing every
> version” discussing deltas instead of packs? Again, I am not a Git
> developer though.
I could be wrong about the details here, I'm not a Git developer either.
From https://git-scm.com/book/en/v2/Git-Internals-Packfiles
it looks like packfiles are implemented using deltas.
>> +
>> +References can either be:
>> +
>> +1. References to an object ID, usually a <<commit,commit>> ID
>> +2. References to another reference. This is called a "symbolic
>> reference".
>
> You seem to have used `**` when introducing terms:
>
> This is a *symbolic reference*
Thanks, will take a look at that.
>> +[[reflogs]]
>> +REFLOGS
>> +-------
>> +
>> +Git stores the history of branch, tag, and HEAD refs in a reflog
>> +(you should read "reflog" as "ref log"). Not every ref is logged by
>
> You’ve heard of the re-flog too?
haha exactly, I just want folks to understand why it's called that :)
> I appreciate that this is the first version and you might have plans
> after this one. But I wonder if this doc could use a fair number of
> `gitlink` to branch out to all the other parts. Like git-reflog(1),
> gitglossary(7).
That's reasonable. Do you often use the "See also" section of
man pages? I've never looked at them so I'm always curious about
how people are actually using them in practice.
I also need to think about what else could link *to* this, because
without attention to discoverability probably nobody will find it.
My main idea so far is actually to add it to
https://git-scm.com/learn
but I wanted to send it here instead of adding it to the website
directly because I thought it could benefit from a more detailed
review.
> Thanks for starting on a whole new doc. That must take quite
> some effort.
All the work on documentation takes a lot of effort, in some
ways it's easier to write something new than to edit something
existing :) |
|
On the Git mailing list, "D. Ben Knoble" wrote (reply to this): On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote:
>
> Thanks for the review!
>
> >> 2. Don't mention that the full name of the branch `main` is
> >> technically `refs/heads/main`. This should likely change but I
> >> haven't worked out how to do it in a clear way yet.
> >
> > I think this is worth getting into. This is a pretty
> > user-facing concept.
>
> I think I'll see if I can figure out a way to mention this and at the
> same time remove most of the rest of the references to the `.git`
> directory when explaining references (which you talked about
> further down), including packed refs.
A colleague will be explaining reflog for an audience tomorrow, and
decided to briefly explain refs, too—which tells me this is
much-needed.
For refs themselves, perhaps "git for-each-ref" is a reasonable place
to start? Since it tells you the refs you have and how to spell them
explicitly regardless of how they are stored?
--
D. Ben Knoble |
|
User |
|
On the Git mailing list, "Julia Evans" wrote (reply to this): On Mon, Oct 6, 2025, at 5:44 PM, D. Ben Knoble wrote:
> On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote:
>>
>> Thanks for the review!
>>
>> >> 2. Don't mention that the full name of the branch `main` is
>> >> technically `refs/heads/main`. This should likely change but I
>> >> haven't worked out how to do it in a clear way yet.
>> >
>> > I think this is worth getting into. This is a pretty
>> > user-facing concept.
>>
>> I think I'll see if I can figure out a way to mention this and at the
>> same time remove most of the rest of the references to the `.git`
>> directory when explaining references (which you talked about
>> further down), including packed refs.
>
> A colleague will be explaining reflog for an audience tomorrow, and
> decided to briefly explain refs, too—which tells me this is
> much-needed.
>
> For refs themselves, perhaps "git for-each-ref" is a reasonable place
> to start? Since it tells you the refs you have and how to spell them
> explicitly regardless of how they are stored?
Interesting, do you use git for-each-ref?
What do you use it for?
> --
> D. Ben Knoble |
|
On the Git mailing list, "D. Ben Knoble" wrote (reply to this): On Mon, Oct 6, 2025 at 5:47 PM Julia Evans <julia@jvns.ca> wrote:
>
>
>
> On Mon, Oct 6, 2025, at 5:44 PM, D. Ben Knoble wrote:
> > On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote:
> >>
> >> Thanks for the review!
> >>
> >> >> 2. Don't mention that the full name of the branch `main` is
> >> >> technically `refs/heads/main`. This should likely change but I
> >> >> haven't worked out how to do it in a clear way yet.
> >> >
> >> > I think this is worth getting into. This is a pretty
> >> > user-facing concept.
> >>
> >> I think I'll see if I can figure out a way to mention this and at the
> >> same time remove most of the rest of the references to the `.git`
> >> directory when explaining references (which you talked about
> >> further down), including packed refs.
> >
> > A colleague will be explaining reflog for an audience tomorrow, and
> > decided to briefly explain refs, too—which tells me this is
> > much-needed.
> >
> > For refs themselves, perhaps "git for-each-ref" is a reasonable place
> > to start? Since it tells you the refs you have and how to spell them
> > explicitly regardless of how they are stored?
>
> Interesting, do you use git for-each-ref?
> What do you use it for?
Ah, yes, but primarily for scripting.
What I should have clarified is that "the tool (I know of) to
interrogate the refs you currently have is git-for-each-ref" (like how
git-ls-remote is the tool to interrogate a remote's refs). It avoids
the issues with assuming "tree .git/refs" or similar will capture the
actual data.
--
D. Ben Knoble |
|
This patch series was integrated into seen via git@0f619ba. |
|
On the Git mailing list, "Kristoffer Haugsbakk" wrote (reply to this): On Mon, Oct 6, 2025, at 05:32, Junio C Hamano wrote:
> "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +MAN7_TXT += gitdatamodel.adoc
>> MAN7_TXT += gitdiffcore.adoc
>> ...
>> +gitdatamodel(7)
>> +===============
>> +
>> +NAME
>> +----
>> +gitdatamodel - Git's core data model
>> +
>> +DESCRIPTION
>> +-----------
>
> The above causes doc-lint to barf.
>[snip]
> You can check locally with "make check-docs" without waiting for my
> integration cycle to push to GitHub CI.
I think you meant `make lint-docs` for both of these. |
|
On the Git mailing list, Patrick Steinhardt wrote (reply to this): On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:
> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
> new file mode 100644
> index 0000000000..4b2cb167dc
> --- /dev/null
> +++ b/Documentation/gitdatamodel.adoc
> @@ -0,0 +1,226 @@
> +gitdatamodel(7)
> +===============
> +
> +NAME
> +----
> +gitdatamodel - Git's core data model
> +
> +DESCRIPTION
> +-----------
> +
> +It's not necessary to understand Git's data model to use Git, but it's
> +very helpful when reading Git's documentation so that you know what it
> +means when the documentation says "object" "reference" or "index".
There's a missing comma after "object".
> +
> +Git's core operations use 4 kinds of data:
> +
> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
> +2. <<references,References>>: branches, tags,
> + remote-tracking branches, etc
> +3. <<index,The index>>, also known as the staging area
> +4. <<reflogs,Reflogs>>
This list makes sense to me. There's of course more data structures in
Git, but all the other data structures shouldn't really matter to users
at all as they are mostly caches or internal details of the on-disk
format.
There's potentially one exception though, namely the Git configuration.
I'd claim that Git "uses" the Git configuration similarly to how it uses
the others, but I get why it's not explicitly mentioned here.
> +[[objects]]
> +OBJECTS
> +-------
> +
> +Commits, trees, blobs, and tag objects are all stored in Git's object database.
> +Every object has:
> +
> +1. an *ID*, which is the SHA-1 hash of its contents.
I think this needs to be adapted to not single out SHA-1 as the only
hashing algorithm. We already support SHA-256, so we should definitely
say that the algorithm can be swapped. Maybe something like:
An *object ID*, which is the cryptographic hash of its contents. By
default, Git uses SHA-1 as object hash, but alternative hashes like
SHA-256 are supported.
> + It's fast to look up a Git object using its ID.
> + The ID is usually represented in hexadecimal, like
> + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
> +2. a *type*. There are 4 types of objects:
> + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
> + and <<tag-object,tag objects>>.
> +3. *contents*. The structure of the contents depends on the type.
Nit: every object also has an object size. Not sure though whether it's
fine to imply that with "contents".
> +Once an object is created, it can never be changed.
> +Here are the 4 types of objects:
> +
> +[[commit]]
> +commits::
> + A commit contains:
> ++
> +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
> + regular commits have 1 parent, merge commits have 2+ parents
I'd say "at least two parents" instead of "2+ parents".
> +2. A *commit message*
> +3. All the *files* in the commit, stored as a *<<tree,tree>>*
> +4. An *author* and the time the commit was authored
> +5. A *committer* and the time the commit was committed
> ++
> +Here's how an example commit is stored:
> ++
> +----
> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
> +author Maya <maya@example.com> 1759173425 -0400
> +committer Maya <maya@example.com> 1759173425 -0400
> +
> +Add README
> +----
In practice, commits can have other headers that are ignored by Git. But
that's certainly not part of Git's core data model, so I don't think we
should mention that here.
> +Like all other objects, commits can never be changed after they're created.
> +For example, "amending" a commit with `git commit --amend` creates a new commit.
> +The old commit will eventually be deleted by `git gc`.
If we mention git-gc(1) I think it would make sense to use
`linkgit:git-gc[1]` instead to provide a link to its man page.
> +[[tree]]
> +trees::
> + A tree is how Git represents a directory. It lists, for each item in
> + the tree:
> ++
> +1. The *permissions*, for example `100644`
I think we should rather call these "mode bits". These bits are
permissions indeed when you have a blob, but for subtrees, symlinks and
submodules they aren't.
> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
> + or <<commit,`commit`>> (a Git submodule)
There's also symlinks.
> +3. The *object ID*
> +4. The *filename*
> ++
> +For example, this is how a tree containing one directory (`src`) and one file
> +(`README.md`) is stored:
> ++
> +----
> +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
> +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
> +----
> ++
> +*NOTE:* The permissions are in the same format as UNIX permissions, but
> +the only allowed permissions for files (blobs) are 644 and 755.
> +
> +[[blob]]
> +blobs::
> + A blob is how Git represents a file. A blob object contains the
> + file's contents.
> ++
> +Storing a new blob for every new version of a file can get big, so
> +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.
I would claim that it's not necessary to mention object compression.
This should be a low-level detail that users don't ever have to worry
about. Furthermore, packing objects isn't only relevant in the context
of blobs: trees for example also tend to compress very well as there
typically is only small incremental updates to trees.
> +[[tag-object]]
> +tag objects::
> + Tag objects (also known as "annotated tags") contain:
> ++
> +1. The *tagger* and tag date
> +2. A *tag message*, similar to a commit message
> +3. The *ID* of the object (often a commit) that they reference
They can also be signed, if we want to mention that.
> +[[references]]
> +REFERENCES
> +----------
> +
> +References are a way to give a name to a commit.
> +It's easier to remember "the changes I'm working on are on the `turtle`
> +branch" than "the changes are in commit bb69721404348e".
> +Git often uses "ref" as shorthand for "reference".
> +
> +References that you create are stored in the `.git/refs` directory,
> +and Git has a few special internal references like `HEAD` that are stored
> +in the base `.git` directory.
This isn't true anymore with the introduction of the reftable backend,
which is slated to become the default backend. I'd argue that this is
another implementation detail that the user shouldn't have to worry
about.
> +References can either be:
> +
> +1. References to an object ID, usually a <<commit,commit>> ID
> +2. References to another reference. This is called a "symbolic reference".
> +
> +Git handles references differently based on which subdirectory of
> +`.git/refs` they're stored in.
So instead of saying "subdirectory", I'd rather say "reference
hierarchy".
In general, I think we should explain that references are layed out
in a hierarchy. This is somewhat obvious with the "files" backend, as we
use directories there. But as we move on to the "reftable" backend this
may become less obvious over time.
> +Here are the main types:
> +
> +[[branch]]
> +branches: `.git/refs/heads/<name>`::
Here and in the other cases we should then strip the `.git/` prefix.
> + A branch is a name for a commit ID.
> + That commit is the latest commit on the branch.
> + Branches are stored in the `.git/refs/heads/` directory.
> ++
> +To get the history of commits on a branch, Git will start at the commit
> +ID the branch references, and then look at the commit's parent(s),
> +the parent's parent, etc.
> +
> +[[tag]]
> +tags: `.git/refs/tags/<name>`::
> + A tag is a name for a commit ID, tag object ID, or other object ID.
> + Tags are stored in the `refs/tags/` directory.
> ++
> +Even though branches and commits are both "a name for a commit ID", Git
> +treats them very differently.
> +Branches are expected to be regularly updated as you work on the branch,
> +but it's expected that a tag will never change after you create it.
This sounds a bit like the user itself needs to update the branch. How
about this instead:
Even though branches and commits are both "a name for a commit ID", Git
treats them very differently:
- Branches can be checked out directly. If so, creating a new
commit will automatically update the checked-out branch to
point to the new commit.
- Tags cannot be checked out directly and don't move when
creating a new commit. Instead, one can only check out the
commit that a branch points to. This is called "detached
HEAD", and the effect is that a new commit will not update
> +[[HEAD]]
> +HEAD: `.git/HEAD`::
> + `HEAD` is where Git stores your current <<branch,branch>>.
> + `HEAD` is normally a symbolic reference to your current branch, for
> + example `ref: refs/heads/main` if your current branch is `main`.
> + `HEAD` can also be a direct reference to a commit ID,
> + that's called "detached HEAD state".
> +
> +[[remote-tracking-branch]]
> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
> + A remote-tracking branch is a name for a commit ID.
> + It's how Git stores the last-known state of a branch in a remote
> + repository. `git fetch` updates remote-tracking branches. When
> + `git status` says "you're up to date with origin/main", it's looking at
> + this.
This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic
reference that indicates the default branch on the remote side.
> +[[other-refs]]
> +Other references::
> + Git tools may create references in any subdirectory of `.git/refs`.
> + For example, linkgit:git-stash[1], linkgit:git-bisect[1],
> + and linkgit:git-notes[1] all create their own references
> + in `.git/refs/stash`, `.git/refs/bisect`, etc.
> + Third-party Git tools may also create their own references.
> ++
> +Git may also create references in the base `.git` directory
> +other than `HEAD`, like `ORIG_HEAD`.
Let's mention that such references are typically spelt all-uppercase
with underscores between. You shouldn't ever create a reference that is
for example called ".git/foo".
We enforce this restriction inconsistently, only, but I don't think that
should keep us from spelling out the common rule.
> +*NOTE:* As an optimization, references may be stored as packed
> +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].
I'd drop this note. It's an internal implementation detail and only true
for the "files" backend. The "reftable" backend stores references quite
differently and doesn't really "pack" references.
> +[[index]]
> +THE INDEX
> +---------
> +
> +The index, also known as the "staging area", contains the current staged
Honestly, I always forget which of these two nouns we are supposed to
use nowadays. I think consensus was to use "index" and avoid using
"staging area"? Not sure though, but I think we should only mention
one of these.
> +version of every file in your Git repository. When you commit, the files
> +in the index are used as the files in the next commit.
> +
> +Unlike a tree, the index is a flat list of files.
> +Each index entry has 4 fields:
> +
> +1. The *permissions*
> +2. The *<<blob,blob>> ID* of the file
> +3. The *filename*
> +4. The *number*. This is normally 0, but if there's a merge conflict
I think we don't call this "number", but "stage".
> + there can be multiple versions (with numbers 0, 1, 2, ..)
> + of the same filename in the index.
> +
> +It's extremely uncommon to look at the index directly: normally you'd
> +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
> +But you can use `git ls-files --stage` to see the index.
> +Here's the output of `git ls-files --stage` in a repository with 2 files:
> +
> +----
> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
> +----
> +
> +[[reflogs]]
> +REFLOGS
> +-------
> +
> +Git stores the history of branch, tag, and HEAD refs in a reflog
> +(you should read "reflog" as "ref log"). Not every ref is logged by
> +default, but any ref can be logged.
If we mention this here, do we maybe want to mention how the user can
decide which references are logged?
> +Each reflog entry has:
> +
> +1. *Before/after *commit IDs*
This will probably misformat as we have three asterisks here, not two.
> +2. *User* who made the change, for example `Maya <maya@example.com>`
> +3. *Timestamp*
Suggestion: "*Timestamp* when that change has been made".
> +4. *Log message*, for example `pull: Fast-forward`
> +
> +Reflogs only log changes made in your local repository.
> +They are not shared with remotes.
We may want ot mention that you can reference reflog entries via
`refs/heads/<branch>@{<reflog-nr>}`.
In general, one thing that I think would be important to highlight in
this document is revisions. Most of the commands tend to not accept
references, but revisions instead, which are a lot more flexible. They
use our do-what-I-mean mechanism to resolve, but also allow the user to
specify commits relative to one another. It's probably sufficient though
to mention them briefly and then redirect to girevisions(7).
Thanks for working on this!
Patrick |
|
User |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> writes:
> On Mon, Oct 6, 2025, at 05:32, Junio C Hamano wrote:
>> "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> +MAN7_TXT += gitdatamodel.adoc
>>> MAN7_TXT += gitdiffcore.adoc
>>> ...
>>> +gitdatamodel(7)
>>> +===============
>>> +
>>> +NAME
>>> +----
>>> +gitdatamodel - Git's core data model
>>> +
>>> +DESCRIPTION
>>> +-----------
>>
>> The above causes doc-lint to barf.
>>[snip]
>> You can check locally with "make check-docs" without waiting for my
>> integration cycle to push to GitHub CI.
>
> I think you meant `make lint-docs` for both of these.
The former is a typo for "causes lint-docs to barf", but I did mean
"make check-docs" as the recipe for local checking.
You could also do "make -C Documentation lint-docs", but that is a
lot more to type ;-).
Thanks. |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): Patrick Steinhardt <ps@pks.im> writes:
>> +Git's core operations use 4 kinds of data:
>> +
>> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
>> +2. <<references,References>>: branches, tags,
>> + remote-tracking branches, etc
>> +3. <<index,The index>>, also known as the staging area
>> +4. <<reflogs,Reflogs>>
>
> This list makes sense to me. There's of course more data structures in
> Git, but all the other data structures shouldn't really matter to users
> at all as they are mostly caches or internal details of the on-disk
> format.
>
> There's potentially one exception though, namely the Git configuration.
> I'd claim that Git "uses" the Git configuration similarly to how it uses
> the others, but I get why it's not explicitly mentioned here.
The core operations do not use Git configuration any more than they
use what is specified by the command line arguments.
>> +[[objects]]
>> +OBJECTS
>> +-------
>> +
>> +Commits, trees, blobs, and tag objects are all stored in Git's object database.
>> +Every object has:
>> +
>> +1. an *ID*, which is the SHA-1 hash of its contents.
>
> I think this needs to be adapted to not single out SHA-1 as the only
> hashing algorithm. We already support SHA-256, so we should definitely
> say that the algorithm can be swapped. Maybe something like:
Good point. Also officially they are called "object name".
> An *object ID*, which is the cryptographic hash of its contents. By
> default, Git uses SHA-1 as object hash, but alternative hashes like
> SHA-256 are supported.
I'd avoid "object name is the result of hashing X" which historically
was a source of question: "why does 'sha1sum README.md' give different
hash from 'git add README.md && git ls-files -s README.md'?"
It is an irrelevant implementation detail (and you'd eventually end
up having to say "X is <type> SP <length> NUL <contents>").
An object name, which is derived cryptographically from its
type, size and contents. All versions of Git can use SHA-1 hash
function, but more recent versions of Git can also use SHA-256
hash function.
>> +commits::
>> + A commit contains:
>> ++
>> +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
>> + regular commits have 1 parent, merge commits have 2+ parents
>
> I'd say "at least two parents" instead of "2+ parents".
Yup, that reads much better.
>> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
>> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
>> +author Maya <maya@example.com> 1759173425 -0400
>> +committer Maya <maya@example.com> 1759173425 -0400
>> +
>> +Add README
>> +----
>
> In practice, commits can have other headers that are ignored by Git. But
> that's certainly not part of Git's core data model, so I don't think we
> should mention that here.
Third-party software can add truly garbage ones that do not have any
meaning, and Git tolerates by ignoring them. But there are others
that Git does pay attention to, like encoding, gpgsig, etc., which
may worth mention (in the form that "these four are what you typically
see, but there may be others" without even naming any).
|
|
On the Git mailing list, "D. Ben Knoble" wrote (reply to this): On Tue, Oct 7, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:
[snip]
> > + A branch is a name for a commit ID.
> > + That commit is the latest commit on the branch.
> > + Branches are stored in the `.git/refs/heads/` directory.
> > ++
> > +To get the history of commits on a branch, Git will start at the commit
> > +ID the branch references, and then look at the commit's parent(s),
> > +the parent's parent, etc.
> > +
> > +[[tag]]
> > +tags: `.git/refs/tags/<name>`::
> > + A tag is a name for a commit ID, tag object ID, or other object ID.
> > + Tags are stored in the `refs/tags/` directory.
> > ++
> > +Even though branches and commits are both "a name for a commit ID", Git
> > +treats them very differently.
> > +Branches are expected to be regularly updated as you work on the branch,
> > +but it's expected that a tag will never change after you create it.
>
> This sounds a bit like the user itself needs to update the branch. How
> about this instead:
>
> Even though branches and commits are both "a name for a commit ID", Git
> treats them very differently:
>
> - Branches can be checked out directly. If so, creating a new
> commit will automatically update the checked-out branch to
> point to the new commit.
>
> - Tags cannot be checked out directly and don't move when
> creating a new commit. Instead, one can only check out the
> commit that a branch points to. This is called "detached
> HEAD", and the effect is that a new commit will not update
missing "the tag." ? |
|
On the Git mailing list, "Julia Evans" wrote (reply to this): On Tue, Oct 7, 2025, at 10:32 AM, Patrick Steinhardt wrote:
> On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:
>> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
>> new file mode 100644
>> index 0000000000..4b2cb167dc
>> --- /dev/null
>> +++ b/Documentation/gitdatamodel.adoc
>> @@ -0,0 +1,226 @@
>> +gitdatamodel(7)
>> +===============
>> +
>> +NAME
>> +----
>> +gitdatamodel - Git's core data model
>> +
>> +DESCRIPTION
>> +-----------
>> +
>> +It's not necessary to understand Git's data model to use Git, but it's
>> +very helpful when reading Git's documentation so that you know what it
>> +means when the documentation says "object" "reference" or "index".
>
> There's a missing comma after "object".
Will fix.
>> +
>> +Git's core operations use 4 kinds of data:
>> +
>> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
>> +2. <<references,References>>: branches, tags,
>> + remote-tracking branches, etc
>> +3. <<index,The index>>, also known as the staging area
>> +4. <<reflogs,Reflogs>>
>
> This list makes sense to me. There's of course more data structures in
> Git, but all the other data structures shouldn't really matter to users
> at all as they are mostly caches or internal details of the on-disk
> format.
>
> There's potentially one exception though, namely the Git configuration.
> I'd claim that Git "uses" the Git configuration similarly to how it uses
> the others, but I get why it's not explicitly mentioned here.
>
>> +[[objects]]
>> +OBJECTS
>> +-------
>> +
>> +Commits, trees, blobs, and tag objects are all stored in Git's object database.
>> +Every object has:
>> +
>> +1. an *ID*, which is the SHA-1 hash of its contents.
>
> I think this needs to be adapted to not single out SHA-1 as the only
> hashing algorithm. We already support SHA-256, so we should definitely
> say that the algorithm can be swapped. Maybe something like:
>
> An *object ID*, which is the cryptographic hash of its contents. By
> default, Git uses SHA-1 as object hash, but alternative hashes like
> SHA-256 are supported.
Makes sense. I might just say "cryptographic hash of its type and contents"
and leave it that. I'm not sure it's worth getting into details
of the exact hash function.
>> + It's fast to look up a Git object using its ID.
>> + The ID is usually represented in hexadecimal, like
>> + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
>> +2. a *type*. There are 4 types of objects:
>> + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
>> + and <<tag-object,tag objects>>.
>> +3. *contents*. The structure of the contents depends on the type.
>
> Nit: every object also has an object size. Not sure though whether it's
> fine to imply that with "contents".
I think it is.
>> +Once an object is created, it can never be changed.
>> +Here are the 4 types of objects:
>> +
>> +[[commit]]
>> +commits::
>> + A commit contains:
>> ++
>> +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
>> + regular commits have 1 parent, merge commits have 2+ parents
>
> I'd say "at least two parents" instead of "2+ parents".
>
>> +2. A *commit message*
>> +3. All the *files* in the commit, stored as a *<<tree,tree>>*
>> +4. An *author* and the time the commit was authored
>> +5. A *committer* and the time the commit was committed
>> ++
>> +Here's how an example commit is stored:
>> ++
>> +----
>> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
>> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
>> +author Maya <maya@example.com> 1759173425 -0400
>> +committer Maya <maya@example.com> 1759173425 -0400
>> +
>> +Add README
>> +----
>
> In practice, commits can have other headers that are ignored by Git. But
> that's certainly not part of Git's core data model, so I don't think we
> should mention that here.
>
>> +Like all other objects, commits can never be changed after they're created.
>> +For example, "amending" a commit with `git commit --amend` creates a new commit.
>> +The old commit will eventually be deleted by `git gc`.
>
> If we mention git-gc(1) I think it would make sense to use
> `linkgit:git-gc[1]` instead to provide a link to its man page.
Agreed.
>> +[[tree]]
>> +trees::
>> + A tree is how Git represents a directory. It lists, for each item in
>> + the tree:
>> ++
>> +1. The *permissions*, for example `100644`
>
> I think we should rather call these "mode bits". These bits are
> permissions indeed when you have a blob, but for subtrees, symlinks and
> submodules they aren't.
I think it's a bit strange to call them mode bits since I thought they were stored
as ASCII strings and it's basically an enum of 5 options, but I see your point.
I think "file mode" will work and that's used elsewhere.
I wonder if it would make sense to list all of the possible file modes if
this isn't documented anywhere else, my impression is that it's a short
list and that it's unlikely to change much in the future.
And listing them all might make it more clear that Git's file modes don't
have much in common with Unix file modes.
I looked for where this is documented and it looks like the only place is
in `man git-fast-import` . That man page says that there are just 5 options
(040000, 160000, 100644, 100755, 120000)
>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
>> + or <<commit,`commit`>> (a Git submodule)
>
> There's also symlinks.
I created a test symlink and it looks like symlinks are stored as type "blob".
I might say which type corresponds to which file mode,
though I'm not sure what type corresponds to the "gitlink" mode (commit?).
I think these are the 5 modes and what they mean / what type they
should have. Not sure about the gitlink mode though.
- `100644`: regular file (with type `blob`)
- `100755`: executable file (with type `blob`)
- `120000`: symbolic link (with type `blob`)
- `040000`: directory (with type `tree`)
- `160000`: gitlink, for use with submodules (with type `commit`)
>> +3. The *object ID*
>> +4. The *filename*
>> ++
>> +For example, this is how a tree containing one directory (`src`) and one file
>> +(`README.md`) is stored:
>> ++
>> +----
>> +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
>> +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
>> +----
>> ++
>> +*NOTE:* The permissions are in the same format as UNIX permissions, but
>> +the only allowed permissions for files (blobs) are 644 and 755.
>> +
>> +[[blob]]
>> +blobs::
>> + A blob is how Git represents a file. A blob object contains the
>> + file's contents.
>> ++
>> +Storing a new blob for every new version of a file can get big, so
>> +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.
>
> I would claim that it's not necessary to mention object compression.
> This should be a low-level detail that users don't ever have to worry
> about. Furthermore, packing objects isn't only relevant in the context
> of blobs: trees for example also tend to compress very well as there
> typically is only small incremental updates to trees.
I discussed why I think this important in another reply,
https://lore.kernel.org/all/51e0a55c-1f1d-4cae-9459-8c2b9220e52d@app.fastmail.com/,
will paste what I said here. I'll think about this more though.
paste follows:
That's true! The reason I think this is important to mention is that I find
that people often "reject" information that they find implausible, even
if it comes from a credible source. ("that can't be true! I must be
not understanding correctly. Oh well, I'll just ignore that!")
I sometimes hear from users that "commits can't be snapshots", because
it would take up too much disk space to store every version of
every commit. So I find that sometimes explaining a little bit about the
implementation can make the information more memorable.
Certainly I'm not able to remember details that don't make sense
with my mental model of how computers work and I don't expect other
people to either, so I think it's important to give an explanation that
handles the biggest "objections".
>> +[[tag-object]]
>> +tag objects::
>> + Tag objects (also known as "annotated tags") contain:
>> ++
>> +1. The *tagger* and tag date
>> +2. A *tag message*, similar to a commit message
>> +3. The *ID* of the object (often a commit) that they reference
>
> They can also be signed, if we want to mention that.
I guess that's true for commit objects too. Not sure whether to
mention it either, can add it if others think it's important.
>> +[[references]]
>> +REFERENCES
>> +----------
>> +
>> +References are a way to give a name to a commit.
>> +It's easier to remember "the changes I'm working on are on the `turtle`
>> +branch" than "the changes are in commit bb69721404348e".
>> +Git often uses "ref" as shorthand for "reference".
>> +
>> +References that you create are stored in the `.git/refs` directory,
>> +and Git has a few special internal references like `HEAD` that are stored
>> +in the base `.git` directory.
>
> This isn't true anymore with the introduction of the reftable backend,
> which is slated to become the default backend. I'd argue that this is
> another implementation detail that the user shouldn't have to worry
> about.
Makes sense, will fix. (as well as other references to the .git prefix and
"subdirectories").
>> +References can either be:
>> +
>> +1. References to an object ID, usually a <<commit,commit>> ID
>> +2. References to another reference. This is called a "symbolic reference".
>> +
>> +Git handles references differently based on which subdirectory of
>> +`.git/refs` they're stored in.
>
> So instead of saying "subdirectory", I'd rather say "reference
> hierarchy".
>
> In general, I think we should explain that references are layed out
> in a hierarchy. This is somewhat obvious with the "files" backend, as we
> use directories there. But as we move on to the "reftable" backend this
> may become less obvious over time.
That makes sense.
>> +[[tag]]
>> +tags: `.git/refs/tags/<name>`::
>> + A tag is a name for a commit ID, tag object ID, or other object ID.
>> + Tags are stored in the `refs/tags/` directory.
>> ++
>> +Even though branches and commits are both "a name for a commit ID", Git
>> +treats them very differently.
>> +Branches are expected to be regularly updated as you work on the branch,
>> +but it's expected that a tag will never change after you create it.
>
> This sounds a bit like the user itself needs to update the branch. How
> about this instead:
>
> Even though branches and commits are both "a name for a commit ID", Git
> treats them very differently:
>
> - Branches can be checked out directly. If so, creating a new
> commit will automatically update the checked-out branch to
> point to the new commit.
>
> - Tags cannot be checked out directly and don't move when
> creating a new commit. Instead, one can only check out the
> commit that a branch points to. This is called "detached
> HEAD", and the effect is that a new commit will not update
I think mentioning that branches can be checked out and that tags can't
is a good idea.
>> +[[HEAD]]
>> +HEAD: `.git/HEAD`::
>> + `HEAD` is where Git stores your current <<branch,branch>>.
>> + `HEAD` is normally a symbolic reference to your current branch, for
>> + example `ref: refs/heads/main` if your current branch is `main`.
>> + `HEAD` can also be a direct reference to a commit ID,
>> + that's called "detached HEAD state".
>> +
>> +[[remote-tracking-branch]]
>> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
>> + A remote-tracking branch is a name for a commit ID.
>> + It's how Git stores the last-known state of a branch in a remote
>> + repository. `git fetch` updates remote-tracking branches. When
>> + `git status` says "you're up to date with origin/main", it's looking at
>> + this.
>
> This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic
> reference that indicates the default branch on the remote side.
Is "refs/remotes/<remote>/HEAD" a remote-tracking branch?
I've never thought about that reference and I'm not sure what to call it.
>> +[[other-refs]]
>> +Other references::
>> + Git tools may create references in any subdirectory of `.git/refs`.
>> + For example, linkgit:git-stash[1], linkgit:git-bisect[1],
>> + and linkgit:git-notes[1] all create their own references
>> + in `.git/refs/stash`, `.git/refs/bisect`, etc.
>> + Third-party Git tools may also create their own references.
>> ++
>> +Git may also create references in the base `.git` directory
>> +other than `HEAD`, like `ORIG_HEAD`.
>
> Let's mention that such references are typically spelt all-uppercase
> with underscores between. You shouldn't ever create a reference that is
> for example called ".git/foo".
>
> We enforce this restriction inconsistently, only, but I don't think that
> should keep us from spelling out the common rule.
That makes sense. I'm also not sure whether third-party
Git tools are "supposed" to create references outside of "refs/",
or whether that's common.
>> +*NOTE:* As an optimization, references may be stored as packed
>> +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].
>
> I'd drop this note. It's an internal implementation detail and only true
> for the "files" backend. The "reftable" backend stores references quite
> differently and doesn't really "pack" references.
>
>> +[[index]]
>> +THE INDEX
>> +---------
>> +
>> +The index, also known as the "staging area", contains the current staged
>
> Honestly, I always forget which of these two nouns we are supposed to
> use nowadays. I think consensus was to use "index" and avoid using
> "staging area"? Not sure though, but I think we should only mention
> one of these.
>
>> +version of every file in your Git repository. When you commit, the files
>> +in the index are used as the files in the next commit.
>> +
>> +Unlike a tree, the index is a flat list of files.
>> +Each index entry has 4 fields:
>> +
>> +1. The *permissions*
>> +2. The *<<blob,blob>> ID* of the file
>> +3. The *filename*
>> +4. The *number*. This is normally 0, but if there's a merge conflict
>
> I think we don't call this "number", but "stage".
Thanks, I see that it's sometimes called "stage number" which is a little
easier to search for so I'll call it that.
>> + there can be multiple versions (with numbers 0, 1, 2, ..)
>> + of the same filename in the index.
>> +
>> +It's extremely uncommon to look at the index directly: normally you'd
>> +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
>> +But you can use `git ls-files --stage` to see the index.
>> +Here's the output of `git ls-files --stage` in a repository with 2 files:
>> +
>> +----
>> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
>> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
>> +----
>> +
>> +[[reflogs]]
>> +REFLOGS
>> +-------
>> +
>> +Git stores the history of branch, tag, and HEAD refs in a reflog
>> +(you should read "reflog" as "ref log"). Not every ref is logged by
>> +default, but any ref can be logged.
>
> If we mention this here, do we maybe want to mention how the user can
> decide which references are logged?
Do you mean by using the setting `core.logAllRefUpdates`?
>> +Each reflog entry has:
>> +
>> +1. *Before/after *commit IDs*
>
> This will probably misformat as we have three asterisks here, not two.
>
>> +2. *User* who made the change, for example `Maya <maya@example.com>`
>> +3. *Timestamp*
>
> Suggestion: "*Timestamp* when that change has been made".
Makes sense.
>> +4. *Log message*, for example `pull: Fast-forward`
>> +
>> +Reflogs only log changes made in your local repository.
>> +They are not shared with remotes.
>
> We may want ot mention that you can reference reflog entries via
> `refs/heads/<branch>@{<reflog-nr>}`.
>
> In general, one thing that I think would be important to highlight in
> this document is revisions. Most of the commands tend to not accept
> references, but revisions instead, which are a lot more flexible. They
> use our do-what-I-mean mechanism to resolve, but also allow the user to
> specify commits relative to one another. It's probably sufficient though
> to mention them briefly and then redirect to girevisions(7).
Will think about this, I'm not sure how to best incorporate that.
Maybe under the commits section.
> Thanks for working on this!
Thanks for the review!
- Julia |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Julia Evans" <julia@jvns.ca> writes:
>>> +tree::
>>> + A tree is how Git represents a directory.
>>> + It can contain files or other trees (which are subdirectories).
>>> + It lists, for each item in the tree:
>>> ++
>>> +1. The *filename*, for example `hello.py`
>>> +2. The *file mode*. Git has these file modes. which are only
>>
>> "has these" -> "uses only these" to clarify that this is an
>> exhaustive enumeration and users cannot invent 100664 and others,
>> which is a mistake Git itself used to make/allow.
>
> I like the idea to make it more explicit that this is an exhaustive
> enumeration. I'll try changing it to this instead: "These are all of the file
> modes in Git (which are only spiritually related to Unix file modes):"
The primary reason why I suggested "uses only these" was because I
thought it would strongly hint that random additions beyond the set
is unwelcome. As long as that implication is not lost, I do not
have strong preference between "we only use these and nothing else"
and your "these are all that we use".
>>> +[[tag-object]]
>>> +tag object::
>>> + Tag objects contain these required fields
>>> + (though there are other optional fields):
>>> ++
>>> +1. The object *ID* it references
>>> +2. The object *type*
>>
>> I would rephrase these to
>>
>> 1. The *ID* of the object it references
>> 2. The *type* of the object it references
>>
>> because (1) a tag object references another object, not ID. To name
>> the object it reference, it uses the object name of it, but just
>> like your name is not you, object name is not the object (it merely
>> is *one* way to refer to it). (2) unless it is very clear to readers
>> that "The object" in 1. and 2. refer to the same object, 2. invites
>> a question "type of which object?".
>
> That makes sense to me, will change it to that.
>
>>> +[[branch]]
>>> +branches: `refs/heads/<name>`::
>>> + A branch refers to a commit ID.
>>
>> A branch refers to a commit object (by its ID). Ditto for tags.
>
> What's the goal of this? I can't tell what misconception you're
> trying to avoid here.
This comes from the same place as the suggestion for the tag object
above, i.e. "a tag object references another object, not ID.".
Exactly the same reasoning applies here. A branch refers to a
commit, and to name the object it references, it uses the object
name of it, but just like your name is not you, object name is not
the object itself.
Thanks. |
Git very often uses the terms "object", "reference", or "index" in its
documentation.
However, it's hard to find a clear explanation of these terms and how
they relate to each other in the documentation. The closest candidates
currently are:
1. `gitglossary`. This makes a good effort, but it's an alphabetically
ordered dictionary and a dictionary is not a good way to learn
concepts. You have to jump around too much and it's not possible to
present the concepts in the order that they should be explained.
2. `gitcore-tutorial`. This explains how to use the "core" Git commands.
This is a nice document to have, but it's not necessary to learn how
`update-index` works to understand Git's data model, and we should
not be requiring users to learn how to use the "plumbing" commands
if they want to learn what the term "index" or "object" means.
3. `gitrepository-layout`. This is a great resource, but it includes a
lot of information about configuration and internal implementation
details which are not related to the data model. It also does
not explain how commits work.
The result of this is that Git users (even users who have been using
Git for 15+ years) struggle to read the documentation because they don't
know what the core terms mean, and it's not possible to add links
to help them learn more.
Add an explanation of Git's data model. Some choices I've made in
deciding what "core data model" means:
1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me
if those are intended to be user facing or if they're more like
internal implementation details.
2. Don't talk about submodules other than by mentioning how they
relate to trees. This is because Git has a lot of special features,
and explaining how they all work exhaustively could quickly go
down a rabbit hole which would make this document less useful for
understanding Git's core behaviour.
3. Don't discuss the structure of a commit message
(first line, trailers etc).
4. Don't mention configuration.
5. Don't mention the `.git` directory, to avoid getting too much into
implementation details
Signed-off-by: Julia Evans <julia@jvns.ca>
|
On the Git mailing list, "Julia Evans" wrote (reply to this): On Mon, Nov 3, 2025, at 8:34 PM, Junio C Hamano wrote:
> "Julia Evans" <julia@jvns.ca> writes:
>
>>>> +tree::
>>>> + A tree is how Git represents a directory.
>>>> + It can contain files or other trees (which are subdirectories).
>>>> + It lists, for each item in the tree:
>>>> ++
>>>> +1. The *filename*, for example `hello.py`
>>>> +2. The *file mode*. Git has these file modes. which are only
>>>
>>> "has these" -> "uses only these" to clarify that this is an
>>> exhaustive enumeration and users cannot invent 100664 and others,
>>> which is a mistake Git itself used to make/allow.
>>
>> I like the idea to make it more explicit that this is an exhaustive
>> enumeration. I'll try changing it to this instead: "These are all of the file
>> modes in Git (which are only spiritually related to Unix file modes):"
>
> The primary reason why I suggested "uses only these" was because I
> thought it would strongly hint that random additions beyond the set
> is unwelcome. As long as that implication is not lost, I do not
> have strong preference between "we only use these and nothing else"
> and your "these are all that we use".
>
>>>> +[[tag-object]]
>>>> +tag object::
>>>> + Tag objects contain these required fields
>>>> + (though there are other optional fields):
>>>> ++
>>>> +1. The object *ID* it references
>>>> +2. The object *type*
>>>
>>> I would rephrase these to
>>>
>>> 1. The *ID* of the object it references
>>> 2. The *type* of the object it references
>>>
>>> because (1) a tag object references another object, not ID. To name
>>> the object it reference, it uses the object name of it, but just
>>> like your name is not you, object name is not the object (it merely
>>> is *one* way to refer to it). (2) unless it is very clear to readers
>>> that "The object" in 1. and 2. refer to the same object, 2. invites
>>> a question "type of which object?".
>>
>> That makes sense to me, will change it to that.
>>
>>>> +[[branch]]
>>>> +branches: `refs/heads/<name>`::
>>>> + A branch refers to a commit ID.
>>>
>>> A branch refers to a commit object (by its ID). Ditto for tags.
>>
>> What's the goal of this? I can't tell what misconception you're
>> trying to avoid here.
>
> This comes from the same place as the suggestion for the tag object
> above, i.e. "a tag object references another object, not ID.".
>
> Exactly the same reasoning applies here. A branch refers to a
> commit, and to name the object it references, it uses the object
> name of it, but just like your name is not you, object name is not
> the object itself.
I agree the ID of a commit is not the same as the commit itself.
The reason I said "refers to a commit ID" is that it's a very concise
explanation and I don't see any risk that the reader will be
confused by it.
Unlike with my name, commit IDs uniquely identify commits, so
I think it will be clear to the reader that the commit ID is going to
be used to retrieve the commit object.
The problem with "A branch refers to a commit object (by its ID)." is
that it introduces some more potential for confusion: it makes it
sound like there might be other ways to refer to a commit object
than by its ID.
Maybe there's another option? To me this introduces the potential
for more confusion and does not solve any specific problem. |
|
This patch series was integrated into seen via git@591e964. |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Julia Evans" <julia@jvns.ca> writes:
> The problem with "A branch refers to a commit object (by its ID)." is
Ah, I didn't mean to say "you must use exactly that phrase".
But branch refers to a commit object, it does not refer to the name
of a commit object.
Perhaps "a branch ref records the object name of a commit object",
would be better? The untold implication of the phrasing is that
anybody who reads what is recorded by that ref can then use the
result to refer to (find) the commit object.
> it introduces some more potential for confusion: it makes it
> sound like there might be other ways to refer to a commit object
> than by its ID.
Yes, there are unbound number of ways to refer to a commit object.
$ git show-ref refs/heads/maint
bb5c624209fcaebd60b9572b2cc8c61086e39b57 refs/heads/maint
The branch ref let you refer to a commit object by recording its
commit object name bb5c6242, but for humans, it is much easier to
refer to the same commit as "v2.51.2^{commit}", which is far more
memorable. Of course I can use master~32^2 to call the same commit
object, which is less memorable gives us a hint that the tip of
master fully contains that maintenance release. What's more useful
depends on how the name will be used, and the hexadecimal object
names happen to be how refs record the objects they refer to.
|
|
On the Git mailing list, "Julia Evans" wrote (reply to this): On Tue, Nov 4, 2025, at 3:53 PM, Junio C Hamano wrote:
> "Julia Evans" <julia@jvns.ca> writes:
>
>> The problem with "A branch refers to a commit object (by its ID)." is
>
> Ah, I didn't mean to say "you must use exactly that phrase".
>
> But branch refers to a commit object, it does not refer to the name
> of a commit object.
>
> Perhaps "a branch ref records the object name of a commit object",
> would be better? The untold implication of the phrasing is that
> anybody who reads what is recorded by that ref can then use the
> result to refer to (find) the commit object.
>
>> it introduces some more potential for confusion: it makes it
>> sound like there might be other ways to refer to a commit object
>> than by its ID.
>
> Yes, there are unbound number of ways to refer to a commit object.
>
> $ git show-ref refs/heads/maint
> bb5c624209fcaebd60b9572b2cc8c61086e39b57 refs/heads/maint
>
> The branch ref let you refer to a commit object by recording its
> commit object name bb5c6242, but for humans, it is much easier to
> refer to the same commit as "v2.51.2^{commit}", which is far more
> memorable. Of course I can use master~32^2 to call the same commit
> object, which is less memorable gives us a hint that the tip of
> master fully contains that maintenance release. What's more useful
> depends on how the name will be used, and the hexadecimal object
> names happen to be how refs record the objects they refer to.
I'm aware that there are other ways to refer to a commit other than its ID, but
as far as I know literally every other way to refer to a commit eventually ends
up going through the commit ID to retrieve the commit.
For example you could use `master^32`. but presumably what that does is
to find `master`, look up the commit ID for `master`, and then go through 32
parents until it finds the appropriate commit ID and then looks up the object
corresponding to that ID
I do not see the point of implying that the commit ID is not "special", or that
it's only one of many ways to find a commit because to me it seems very special,
since there is no way I know of to retrieve a commit that doesn't ultimately
end up using the commit ID at some point. (though that ID might not be encoded
in hexadecimal) |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Julia Evans" <julia@jvns.ca> writes:
> I do not see the point of implying that the commit ID is not "special", or that
> it's only one of many ways to find a commit because to me it seems very special,
> since there is no way I know of to retrieve a commit that doesn't ultimately
> end up using the commit ID at some point. (though that ID might not be encoded
> in hexadecimal)
That is not what I am trying to say. The hexadecimal name is the
most neutral way to refer to a commit object, and in that sense it
is special. It is the way ref subsystem uses to record the name of
objects, and that makes it special enough.
But that does not mean that the name _is_ the object. The
hexadecimal name is a way you use to name the object, but is not the
object itself, and the special-ness of that name does not change it. |
|
On the Git mailing list, "Julia Evans" wrote (reply to this): On Tue, Nov 4, 2025, at 6:45 PM, Junio C Hamano wrote:
> "Julia Evans" <julia@jvns.ca> writes:
>
>> I do not see the point of implying that the commit ID is not "special", or that
>> it's only one of many ways to find a commit because to me it seems very special,
>> since there is no way I know of to retrieve a commit that doesn't ultimately
>> end up using the commit ID at some point. (though that ID might not be encoded
>> in hexadecimal)
>
> That is not what I am trying to say. The hexadecimal name is the
> most neutral way to refer to a commit object, and in that sense it
> is special. It is the way ref subsystem uses to record the name of
> objects, and that makes it special enough.
>
> But that does not mean that the name _is_ the object. The
> hexadecimal name is a way you use to name the object, but is not the
> object itself, and the special-ness of that name does not change it.
Okay. I still do not understand at all why this is so important to you
(for the reasons I mentioned before) but I'll see if there's anything I can do. |
|
On the Git mailing list, Ben Knoble wrote (reply to this): > Le 4 nov. 2025 à 19:02, Julia Evans <julia@jvns.ca> a écrit :
>
>
>
>> On Tue, Nov 4, 2025, at 6:45 PM, Junio C Hamano wrote:
>> "Julia Evans" <julia@jvns.ca> writes:
>>> I do not see the point of implying that the commit ID is not "special", or that
>>> it's only one of many ways to find a commit because to me it seems very special,
>>> since there is no way I know of to retrieve a commit that doesn't ultimately
>>> end up using the commit ID at some point. (though that ID might not be encoded
>>> in hexadecimal)
>> That is not what I am trying to say. The hexadecimal name is the
>> most neutral way to refer to a commit object, and in that sense it
>> is special. It is the way ref subsystem uses to record the name of
>> objects, and that makes it special enough.
>> But that does not mean that the name _is_ the object. The
>> hexadecimal name is a way you use to name the object, but is not the
>> object itself, and the special-ness of that name does not change it.
>
> Okay. I still do not understand at all why this is so important to you
> (for the reasons I mentioned before) but I'll see if there's anything I can do.
Perhaps one way to look at is, what diagram would I draw given different textual explanations?
The diagram we _want_ folks to draw (?) is the one where a branch points at a commit [a circle, perhaps], which points to a tree [triangle] and recursively blobs [squares], like I’ve seen Stolee draw for GitHub blogs.
We might also want folks to label the arrows with names, or not.
One way to interpret the “branch refers to a commit ID” might be to draw a diagram where the branch points to an ID label, and to find the circle you have to separately consult a different part of the diagram.
Both seem useful to me, though as the former has fewer moving pieces might be better for the model this document describes? I dunno. |
|
This patch series was integrated into seen via git@b695592. |
|
On the Git mailing list, "Julia Evans" wrote (reply to this): On Tue, Nov 4, 2025, at 10:21 PM, Ben Knoble wrote:
>> Le 4 nov. 2025 à 19:02, Julia Evans <julia@jvns.ca> a écrit :
>>
>>
>>
>>> On Tue, Nov 4, 2025, at 6:45 PM, Junio C Hamano wrote:
>>> "Julia Evans" <julia@jvns.ca> writes:
>>>> I do not see the point of implying that the commit ID is not "special", or that
>>>> it's only one of many ways to find a commit because to me it seems very special,
>>>> since there is no way I know of to retrieve a commit that doesn't ultimately
>>>> end up using the commit ID at some point. (though that ID might not be encoded
>>>> in hexadecimal)
>>> That is not what I am trying to say. The hexadecimal name is the
>>> most neutral way to refer to a commit object, and in that sense it
>>> is special. It is the way ref subsystem uses to record the name of
>>> objects, and that makes it special enough.
>>> But that does not mean that the name _is_ the object. The
>>> hexadecimal name is a way you use to name the object, but is not the
>>> object itself, and the special-ness of that name does not change it.
>>
>> Okay. I still do not understand at all why this is so important to you
>> (for the reasons I mentioned before) but I'll see if there's anything I can do.
>
> Perhaps one way to look at is, what diagram would I draw given
> different textual explanations?
>
> The diagram we _want_ folks to draw (?) is the one where a branch
> points at a commit [a circle, perhaps], which points to a tree
> [triangle] and recursively blobs [squares], like I’ve seen Stolee draw
> for GitHub blogs.
>
> We might also want folks to label the arrows with names, or not.
>
> One way to interpret the “branch refers to a commit ID” might be to
> draw a diagram where the branch points to an ID label, and to find the
> circle you have to separately consult a different part of the diagram.
Yes, the most common type of Git diagram I see is something like this:
https://git-scm.com/book/en/v2/images/head-to-master.png
which only includes references, commits, and HEAD.
That's the diagram I have in mind when writing this text, and I think it's
a useful and accurate diagram to keep in mind, and it's one that you see
very often when using Git tools, including in `git log --graph`. (it's not
a _complete_ diagram of every type of object, but diagrams do not need to be
complete to be accurate)
I personally would not use a graph diagram to explain how commits relate to
trees and blobs (normally I use `git cat-file -p` instead, like I did in this
`gitdatamodel` document. You can see this comic for a "visual" example of how
I've approached discussing trees and blobs in the past with `git cat-file -p`
https://wizardzines.com/comics/explore-a-commit/).
> Both seem useful to me, though as the former has fewer moving pieces
> might be better for the model this document describes? I dunno. |
|
This patch series was integrated into seen via git@7ac718c. |
…ctions-prerequisites More prerequisites to move GitGitGadget towards GitHub Actions
This PR is part 3 of addressing #609, and it is stacked on top of #1980 and #1981 (and therefore contains also the commits of those PRs), therefore I will leave this in draft mode until those PRs are merged. The grand idea is to bundle the `CIHelper` class together with all its direct and transitive dependencies into one big, honking `dist/index.js`, and then add a set of really minimal GitHub Actions that call into `CIHelper`. The Actions are added in sub-directories so that they can be called in GitHub workflows via e.g. `- uses: gitgitgadget/gitgitgadget/update-prs@1`. The component used for bundling `CIHelper` is [`@vercel/ncc` ](https://www.npmjs.com/package/@vercel/ncc). To support acting as a GitHub Action, [`@actions/core`](https://www.npmjs.com/package/@actions/core) is installed. To allow for really minimal GitHub Actions, the `CIHelper` class is augmented accordingly to re-implement more logic that is currently either in `misc-helper.ts` or in the (non-public 😞) Azure Pipelines definitions. The naming convention for specifying the necessary tokens as GitHub Actions inputs is: - `upstream-repo-token`: This is to comment on PRs in `git/git` - `pr-repo-token`: This is to comment on PRs in `gitgitgadget/git` (as well as to be able to push to that repository) - `test-repo-token`: This is to comment on PRs in `dscho/git` (used exclusively for testing) To clarify, here is a diagram: ```mermaid graph TD user["user (contributor)"] upstream-repo["upstream-repo (authoritative project repo)"] pr-repo["pr-repo (GitGitGadget-enabled GitHub repo)"] GitGitGadget["GitGitGadget"] mailing-list["mailing-list"] user -->|"opens PR"| pr-repo user -->|"opens PR (if GitHub App installed)"| upstream-repo upstream-repo -->|"GitGitGadget syncs branches to"| pr-repo pr-repo -->|"slash commands"| GitGitGadget upstream-repo -->|"slash commands (if App installed)"| GitGitGadget GitGitGadget -->|"sends patch series"| mailing-list ```
|
On the Git mailing list, Ben Knoble wrote (reply to this): > Le 5 nov. 2025 à 11:27, Julia Evans <julia@jvns.ca> a écrit :
>
>
>
> On Tue, Nov 4, 2025, at 10:21 PM, Ben Knoble wrote:
>>>> Le 4 nov. 2025 à 19:02, Julia Evans <julia@jvns.ca> a écrit :
>>>
>>>
>>>
>>>> On Tue, Nov 4, 2025, at 6:45 PM, Junio C Hamano wrote:
>>>> "Julia Evans" <julia@jvns.ca> writes:
>>>>> I do not see the point of implying that the commit ID is not "special", or that
>>>>> it's only one of many ways to find a commit because to me it seems very special,
>>>>> since there is no way I know of to retrieve a commit that doesn't ultimately
>>>>> end up using the commit ID at some point. (though that ID might not be encoded
>>>>> in hexadecimal)
>>>> That is not what I am trying to say. The hexadecimal name is the
>>>> most neutral way to refer to a commit object, and in that sense it
>>>> is special. It is the way ref subsystem uses to record the name of
>>>> objects, and that makes it special enough.
>>>> But that does not mean that the name _is_ the object. The
>>>> hexadecimal name is a way you use to name the object, but is not the
>>>> object itself, and the special-ness of that name does not change it.
>>>
>>> Okay. I still do not understand at all why this is so important to you
>>> (for the reasons I mentioned before) but I'll see if there's anything I can do.
>>
>> Perhaps one way to look at is, what diagram would I draw given
>> different textual explanations?
>>
>> The diagram we _want_ folks to draw (?) is the one where a branch
>> points at a commit [a circle, perhaps], which points to a tree
>> [triangle] and recursively blobs [squares], like I’ve seen Stolee draw
>> for GitHub blogs.
>>
>> We might also want folks to label the arrows with names, or not.
>>
>> One way to interpret the “branch refers to a commit ID” might be to
>> draw a diagram where the branch points to an ID label, and to find the
>> circle you have to separately consult a different part of the diagram.
>
> Yes, the most common type of Git diagram I see is something like this:
> https://git-scm.com/book/en/v2/images/head-to-master.png
> which only includes references, commits, and HEAD.
>
> That's the diagram I have in mind when writing this text, and I think it's
> a useful and accurate diagram to keep in mind, and it's one that you see
> very often when using Git tools, including in `git log --graph`. (it's not
> a _complete_ diagram of every type of object, but diagrams do not need to be
> complete to be accurate)
>
> I personally would not use a graph diagram to explain how commits relate to
> trees and blobs (normally I use `git cat-file -p` instead, like I did in this
> `gitdatamodel` document. You can see this comic for a "visual" example of how
> I've approached discussing trees and blobs in the past with `git cat-file -p`
> https://wizardzines.com/comics/explore-a-commit/).
Fair enough. Here’s a post I “stole” ;) the shapes from, for posterity:
https://github.blog/open-source/git/commits-are-snapshots-not-diffs
My larger point was: since these are the diagrams I’m imagining we want to convey to a reader, perhaps ID can be omitted for brevity? IOW, the relationship between objects is the thing to highlight.
OTOH, when exploring the data, especially at the plumbing level it seems we have to do the “pointer-chasing” ourselves (see cat-file).
So idk. |
|
There was a status update in the "Cooking" section about the branch Add a new manual that describes the data model. Expecting a (hopefully small and final) reroll? cf. <aQhcZwv0PdwNc6RW@pks.im> source: <pull.1981.v5.git.1761856336360.gitgitgadget@gmail.com> |
|
This patch series was integrated into seen via git@6dc2e88. |
|
This patch series was integrated into seen via git@53cd7c1. |
|
This patch series was integrated into seen via git@2c12d3a. |
|
This patch series was integrated into seen via git@20a4d60. |
|
This patch series was integrated into seen via git@401bbf4. |
|
There was a status update in the "Cooking" section about the branch Add a new manual that describes the data model. Expecting a (hopefully small and final) reroll? cf. <aQhcZwv0PdwNc6RW@pks.im> source: <pull.1981.v5.git.1761856336360.gitgitgadget@gmail.com> |
|
/submit |
|
Submitted as pull.1981.v6.git.1762545177204.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Julia Evans <julia@jvns.ca>
>
> Git very often uses the terms "object", "reference", or "index" in its
> documentation.
Not about the updated text (which I haven't carefully read yet), but
we'd need this squashed in to avoid xml that does not validate when
using AsciiDoc (not Asciidoctor) to format gitdatamode.7
documentation.
XMLTO gitdatamodel.7
xmlto: /home/gitster/w/git.git/Documentation/gitdatamodel.xml does not validate (status 3)
xmlto: Fix document syntax or use --skip-validation option
Document /home/gitster/w/git.git/Documentation/gitdatamodel.xml does not validate
Perhaps I forgot to send this after queuing the previous round, even
though it was queued on top of the previous round in 'seen'. The
patch still applies cleanly to this version, and seems to fix the
breakage for me.
... goes and looks ...
Ah, no, I did not forget. The same patch is in the review thread of
the previous round:
https://lore.kernel.org/git/xmqqcy62213a.fsf@gitster.g/
Documentation/gitdatamodel.adoc | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
index 1cefbb4833..eaab3f800b 100644
--- a/Documentation/gitdatamodel.adoc
+++ b/Documentation/gitdatamodel.adoc
@@ -18,13 +18,13 @@ means when the documentation says "object", "reference" or "index".
Git's core operations use 4 kinds of data:
-1. <<objects,Objects>>: commits, trees, blobs, and tag objects
+1. <<object,Objects>>: commits, trees, blobs, and tag objects
2. <<references,References>>: branches, tags,
remote-tracking branches, etc
3. <<index,The index>>, also known as the staging area
4. <<reflogs,Reflogs>>: logs of changes to references ("ref log")
-[[objects]]
+[[object]]
OBJECTS
-------
--
2.52.0-rc1-455-g30608eb744
|
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
> changes in v6:
>
> * Make punctuation more consistent (from Patrick's review)
Good.
> * Explain more about when exactly amended commits will get deleted
> (when their reflog entry expires), from Junio's review
Looked good.
> * Be more explicit that there are only 5 file modes in Git (from
> Junio's review)
I find "These are all of the file modes in Git" hard to read and
understand, and more importantly, does not imply that we won't be
adding any others strongly enough, than something like "Git uses
only the following modes to represent the objects it stores".
> * Make tag object description clearer (from Junio's review)
OK.
> * We had a long discussion about the phrasing of "A branch refers to a
> commit ID" but I didn't come up with any ideas for how to improve the
> phrasing so I left it as is.
I gave you something that is clearly an improvement there, though.
Just like a tag object records "the ID of the object it references",
a branch records "the ID of the commit it references".
Another thing we discussed and a better alternative offered during
the last round was "base directory", to which Patrick mentioned
"we rather consistently use 'root tree'"
cf. https://lore.kernel.org/git/aQhcbHJjiI5GtV6Y@pks.im/
Other than a few minor points I pointed out above, and the broken
xml id/idref that does not validate, this round looks good to me.
Thanks.
|
|
On the Git mailing list, "Julia Evans" wrote (reply to this): On Fri, Nov 7, 2025, at 4:23 PM, Junio C Hamano wrote:
> "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> changes in v6:
>>
>> * Make punctuation more consistent (from Patrick's review)
>
> Good.
>
>> * Explain more about when exactly amended commits will get deleted
>> (when their reflog entry expires), from Junio's review
>
> Looked good.
>
>> * Be more explicit that there are only 5 file modes in Git (from
>> Junio's review)
>
> I find "These are all of the file modes in Git" hard to read and
> understand, and more importantly, does not imply that we won't be
> adding any others strongly enough, than something like "Git uses
> only the following modes to represent the objects it stores".
>
>> * Make tag object description clearer (from Junio's review)
I wonder if it would help to de-emphasize the octal representation
of the file modes, and instead give them names since (from a
data model section Git's file modes are really more like an enum with
5 values than )
Something like this:
Git has 5 file modes:
- *regular file* (with <<object,object type>> `blob`)
- *executable file* (with type `blob`)
- *symbolic link* (with type `blob`)
- *directory* (with type `tree`)
- *gitlink*, for use with submodules (with type `commit`)
NOTE: Git normally displays file modes in the same format as Unix file modes
(100644, 100755, 120000, 040000, and 160000 respectively), but file modes are
only spiritually related to Unix file modes.
> OK.
>
>> * We had a long discussion about the phrasing of "A branch refers to a
>> commit ID" but I didn't come up with any ideas for how to improve the
>> phrasing so I left it as is.
>
> I gave you something that is clearly an improvement there, though.
> Just like a tag object records "the ID of the object it references",
> a branch records "the ID of the commit it references".
To me an "improvement" is something that helps the reader understand how Git's
data model, and I do not understand in what way this rephrasing helps the
reader, or how you think the current phrasing might cause confusion for the
reader.
From my point of view "a branch refers to a commit ID" clearly means the exact
same thing as "a branch records the ID of the commit it references" and
"a branch records the ID of the commit it references" is just a less clear and
more indirect way to communicate that.
> Another thing we discussed and a better alternative offered during
> the last round was "base directory", to which Patrick mentioned
> "we rather consistently use 'root tree'"
>
> cf. https://lore.kernel.org/git/aQhcbHJjiI5GtV6Y@pks.im/
I think it would be better to stick with "directory" here, because I've gotten
several reader comments saying that they do not understand the
term "tree" when it is used as a synonym for "directory".
Maybe "root directory"?
> Other than a few minor points I pointed out above, and the broken
> xml id/idref that does not validate, this round looks good to me.
Will fix the broken XML. |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Julia Evans" <julia@jvns.ca> writes:
> I wonder if it would help to de-emphasize the octal representation
> of the file modes, and instead give them names since (from a
> data model section Git's file modes are really more like an enum with
> 5 values than )
>
> Something like this:
>
> Git has 5 file modes:
>
> - *regular file* (with <<object,object type>> `blob`)
> - *executable file* (with type `blob`)
> - *symbolic link* (with type `blob`)
> - *directory* (with type `tree`)
> - *gitlink*, for use with submodules (with type `commit`)
>
> NOTE: Git normally displays file modes in the same format as Unix file modes
> (100644, 100755, 120000, 040000, and 160000 respectively), but file modes are
> only spiritually related to Unix file modes.
Then, I would suggest further deemphasize the "file modes" even
more.
* Git stores/tracks 5 different file types, which are
non-executable files, executable files, symbolic links,
directories, and gitlinks.
* Git uses one bitpattern each to mark these 5 different kinds
of things in tree objects. These bitpatterns were loosely
modelled after UNIX file mode bits.
The first half entirely avoids saying "mode" and that is very
deliberate.
> ... I do not understand in what way this rephrasing helps the
> reader, or how you think the current phrasing might cause confusion for the
> reader.
A branch (or any ref) does *not* *REFERENCE* an ID. They refer to
objects by *recording* an ID. The distinction is not clear with
your wording.
>> Another thing we discussed and a better alternative offered during
>> the last round was "base directory", to which Patrick mentioned
>> "we rather consistently use 'root tree'"
>>
>> cf. https://lore.kernel.org/git/aQhcbHJjiI5GtV6Y@pks.im/
>
> I think it would be better to stick with "directory" here, because I've gotten
> several reader comments saying that they do not understand the
> term "tree" when it is used as a synonym for "directory".
>
> Maybe "root directory"?
I am OK with "root" but that is conditional; only if it is not used
together with the word "directory". We are not talking about "root
directory" where common directories like /usr, /etc, /dev and /tmp
hang immediately below. If we use the word "directory", I'd
strongly prefer to see it with adjective like "top-level" that
implies that it is something different from "root directory" but is
relative to the project in question.
Thanks. |
|
On the Git mailing list, Junio C Hamano wrote (reply to this): Junio C Hamano <gitster@pobox.com> writes:
> "Julia Evans" <julia@jvns.ca> writes:
>
>> I wonder if it would help to de-emphasize the octal representation
>> of the file modes, and instead give them names since (from a
>> data model section Git's file modes are really more like an enum with
>> 5 values than )
>>
>> Something like this:
>>
>> Git has 5 file modes:
>>
>> - *regular file* (with <<object,object type>> `blob`)
>> - *executable file* (with type `blob`)
>> - *symbolic link* (with type `blob`)
>> - *directory* (with type `tree`)
>> - *gitlink*, for use with submodules (with type `commit`)
>>
>> NOTE: Git normally displays file modes in the same format as Unix file modes
>> (100644, 100755, 120000, 040000, and 160000 respectively), but file modes are
>> only spiritually related to Unix file modes.
>
> Then, I would suggest further deemphasize the "file modes" even
> more.
>
> * Git stores/tracks 5 different file types, which are
> non-executable files, executable files, symbolic links,
> directories, and gitlinks.
>
> * Git uses one bitpattern each to mark these 5 different kinds
> of things in tree objects. These bitpatterns were loosely
> modelled after UNIX file mode bits.
>
> The first half entirely avoids saying "mode" and that is very
> deliberate.
>
>>> Another thing we discussed and a better alternative offered during
>>> the last round was "base directory", to which Patrick mentioned
>>> "we rather consistently use 'root tree'"
>>>
>>> cf. https://lore.kernel.org/git/aQhcbHJjiI5GtV6Y@pks.im/
>>
>> I think it would be better to stick with "directory" here, because I've gotten
>> several reader comments saying that they do not understand the
>> term "tree" when it is used as a synonym for "directory".
>>
>> Maybe "root directory"?
>
> I am OK with "root" but that is conditional; only if it is not used
> together with the word "directory". We are not talking about "root
> directory" where common directories like /usr, /etc, /dev and /tmp
> hang immediately below. If we use the word "directory", I'd
> strongly prefer to see it with adjective like "top-level" that
> implies that it is something different from "root directory" but is
> relative to the project in question.
The above two points should probably be trivial to address. I've
already squashed in the xml validation fixes to [v6], so let's
finish the rest quickly.
I have no more words to offer somebody, who says she does not know
why saying "branch records ID of the commit it refers to" is an
improvement over "branch refers to ID of the commit", when she
already accepts that "The *ID* of the object it references" is a
better way than "The object *ID* it references" to describe one of
the fields in an annotated tag object. So I wouldn't mind if v7
still said "branch refers to commit id". We can update it with
follow-up series as needed, and it is not worth blocking the rest of
the document.
Refs (including branches), refer to objects exactly the same way an
annotated tag refers to another object, or a tree entry in a tree
object refers to a blob, tree, or a commit object. Recording the
hexadecimal hash is an implementation detail of the way how they
reference the object, and the phrasing used for the tag field in an
annotated tag reflects that by clearly distinguishing
- recording the ID
- referring to the object
as two separate things. The former is merely a means to the end
which is the latter, i.e. the purpose of refs, tree-entry in a tree,
tag field in a tag object, and all other things that refer to an
object by recording its ID.
|
Changes in v2:
The biggest change is to remove all mentions of the
.gitdirectory, and explain references in a way that doesn't refer to "directories" at all, and instead talks about the "hierarchy" (from Kristoffer and Patrick's reviews).Also:
git gca little higher level and took some ideas from Patrick's suggested wording (from Patrick's and Kroftoffer's reviews)git gc, since it perhaps opens up too much of a rabbit hole: "how doesgit gcdecide which commits to clean up?". (from Kristoffer's review)man git-confignon-changes:
tag v1.0.0) but I didn't mention it yet because I couldn't figure out what the purpose of that field is (I thought the tag name was stored in the reference, why is it duplicated in the tag object?)Changes in v3:
I asked for feedback from Git users on Mastodon and got 220 pieces of feedback from 48 different users. People seemed very excited to read about Git's data model. Usually I judge explanations by what folks report learning from them. Here people reported learning:
git adda file, Git will create an objectAlso (of course) there were quite a few points of confusion! The main 4 pieces of feedback were
.gitdirectory, which I'd removed in v2. This seems most important for.git/refs, so I added a hopefully accurate note about how refs are stored by default, with a comment about one of the major implications. I did not discuss where objects or the index are stored, because I don't think the implementation details of how objects are stored are as important, and there are better tools for viewing the "raw" state of objects and the index (withgit cat-file -porgit ls-files --staged).Here's every other change I made in response to the feedback, as well as a few comments that I did not address.
intro:
objects:
git ls-files --stageas a way to view the index, so addgit cat-file -pas well in a notecommits:
git cherry-pickthat I'm not 100% happy with (what if the reader doesn't know what cherry-pick does?). There might be a better example to give here.trees:
tag objects:
tags:
git tag -f). Say instead that tags are "usually" not changed.HEAD:
git switch). I don't think we can get into all of that here, so refer to the DETACHED HEAD section ofgit-checkoutinstead. I'm not totally happy with the current version of that section but that seems like the most practical solution right now.remote-tracking branches:
refs/remotes/<remote>/HEAD.the index:
reflogs
Not fixed:
HEAD: HEADthing looks weird, it made more sense when it wasHEAD: .git/HEAD. Will think about this.git reflog showdoesn't list the user who made the change.git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=isoseems to work but it's really a mouthful, not sure it's useful to include all that.IDREF attribute linkend references an unknown ID "tree")changes in v4:
This is a combination of trying to make some of the intro text a little more "friendly" for someone new to Git's data model, avoiding implying things that are false, and removing information that isn't relevant to the data model.
intro:
objects:
git cat-file -pdoes, since it might be misleading and if people want to know they can read the man page (from Junio's review)commits:
git show" (from Junio's review)trees:
blobs:
branches:
.git(from Junio's review)HEAD:
index:
reflog:
git reflog mainin the example instead of the contents of the reflog file, to avoid showing the user and before commit IDchanges in v5:
Mostly smaller tweaks this time. The only major addition is to add a note about how unreachable objects may be deleted.
From Junio's review:
git ls-fileschanges in v6:
cc: "Kristoffer Haugsbakk" kristofferhaugsbakk@fastmail.com
cc: "D. Ben Knoble" ben.knoble@gmail.com
cc: Patrick Steinhardt ps@pks.im