Skip to content

Commit 54fe83a

Browse files
bk2204gitster
authored andcommitted
rust: add a new binary loose object map format
Our current loose object format has a few problems. First, it is not efficient: the list of object IDs is not sorted and even if it were, there would not be an efficient way to look up objects in both algorithms. Second, we need to store mappings for things which are not technically loose objects but are not packed objects, either, and so cannot be stored in a pack index. These kinds of things include shallows, their parents, and their trees, as well as submodules. Yet we also need to implement a sensible way to store the kind of object so that we can prune unneeded entries. For instance, if the user has updated the shallows, we can remove the old values. For these reasons, introduce a new binary loose object map format. The careful reader will notice that it resembles very closely the pack index v3 format. Add an in-memory loose object map as well, and allow enabling writing to a batched map, which can then be written later as one of the binary loose object maps. Include several tests for round tripping and data lookup across algorithms. Note that the use of this code elsewhere in Git will involve some C code and some C-compatible code in Rust that will be introduced in a future commit. Thus, for example, we ignore the fact that if there is no current batch and the caller asks for data to be written, this code does nothing, mostly because this code also does not involve itself with opening or manipulating files. The C code that we will add later will implement this functionality at a higher level and take care of this, since the code which is necessary for writing to the object store is deeply involved with our C abstractions and it would require extensive work (which would not be especially valuable at this point) to port those to Rust. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
1 parent bfcd218 commit 54fe83a

File tree

5 files changed

+1019
-0
lines changed

5 files changed

+1019
-0
lines changed

Documentation/gitformat-loose.adoc

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ SYNOPSIS
1010
--------
1111
[verse]
1212
$GIT_DIR/objects/[0-9a-f][0-9a-f]/*
13+
$GIT_DIR/objects/loose-object-idx
14+
$GIT_DIR/objects/loose-map/map-*.map
1315
1416
DESCRIPTION
1517
-----------
@@ -48,6 +50,108 @@ stored under
4850
Similarly, a blob containing the contents `abc` would have the uncompressed
4951
data of `blob 3\0abc`.
5052
53+
== Loose object mapping
54+
55+
When the `compatObjectFormat` option is used, Git needs to store a mapping
56+
between the repository's main algorithm and the compatibility algorithm. There
57+
are two formats for this: the legacy mapping and the modern mapping.
58+
59+
=== Legacy mapping
60+
61+
The compatibility mapping is stored in a file called
62+
`$GIT_DIR/objects/loose-object-idx`. The format of this file looks like this:
63+
64+
# loose-object-idx
65+
(main-name SP compat-name LF)*
66+
67+
`main-name` refers to hexadecimal object ID of the object in the main
68+
repository format and `compat-name` refers to the same thing, but for the
69+
compatibility format.
70+
71+
This format is read if it exists but is not written.
72+
73+
Note that carriage returns are not permitted in this file, regardless of the
74+
host system or configuration.
75+
76+
=== Modern mapping
77+
78+
The modern mapping consists of a set of files under `$GIT_DIR/objects/loose`
79+
ending in `.map`. The portion of the filename before the extension is that of
80+
the hash checksum in hex format.
81+
82+
`git pack-objects` will repack existing entries into one file, removing any
83+
unnecessary objects, such as obsolete shallow entries or loose objects that
84+
have been packed.
85+
86+
==== Mapping file format
87+
88+
- A header appears at the beginning and consists of the following:
89+
* A 4-byte mapping signature: `LMAP`
90+
* 4-byte version number: 1
91+
* 4-byte length of the header section.
92+
* 4-byte number of objects declared in this map file.
93+
* 4-byte number of object formats declared in this map file.
94+
* For each object format:
95+
** 4-byte format identifier (e.g., `sha1` for SHA-1)
96+
** 4-byte length in bytes of shortened object names. This is the
97+
shortest possible length needed to make names in the shortened
98+
object name table unambiguous.
99+
** 8-byte integer, recording where tables relating to this format
100+
are stored in this index file, as an offset from the beginning.
101+
* 8-byte offset to the trailer from the beginning of this file.
102+
* Zero or more additional key/value pairs (4-byte key, 4-byte value), which
103+
may optionally declare one or more chunks. No chunks are currently
104+
defined. Readers must ignore unrecognized keys.
105+
- Zero or more NUL bytes. These are used to improve the alignment of the
106+
4-byte quantities below.
107+
- Tables for the first object format:
108+
* A sorted table of shortened object names. These are prefixes of the names
109+
of all objects in this file, packed together without offset values to
110+
reduce the cache footprint of the binary search for a specific object name.
111+
* A sorted table of full object names.
112+
* A table of 4-byte metadata values.
113+
* Zero or more chunks. A chunk starts with a four-byte chunk identifier and
114+
a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
115+
size (not including the identifier, parameter, or size), plus the chunk
116+
data.
117+
- Zero or more NUL bytes.
118+
- Tables for subsequent object formats:
119+
* A sorted table of shortened object names. These are prefixes of the names
120+
of all objects in this file, packed together without offset values to
121+
reduce the cache footprint of the binary search for a specific object name.
122+
* A table of full object names in the order specified by the first object format.
123+
* A table of 4-byte values mapping object name order to the order of the
124+
first object format. For an object in the table of sorted shortened object
125+
names, the value at the corresponding index in this table is the index in
126+
the previous table for that same object.
127+
* Zero or more NUL bytes.
128+
- The trailer consists of the following:
129+
* Hash checksum of all of the above.
130+
131+
The lower six bits of each metadata table contain a type field indicating the
132+
reason that this object is stored:
133+
134+
0::
135+
Reserved.
136+
1::
137+
This object is stored as a loose object in the repository.
138+
2::
139+
This object is a shallow entry. The mapping refers to a shallow value
140+
returned by a remote server.
141+
3::
142+
This object is a submodule entry. The mapping refers to the commit stored
143+
representing a submodule.
144+
145+
Other data may be stored in this field in the future. Bits that are not used
146+
must be zero.
147+
148+
All 4-byte numbers are in network order and must be 4-byte aligned in the file,
149+
so the NUL padding may be required in some cases.
150+
151+
Note that the hash at the end of the file is in whatever the repository's main
152+
algorithm is. In the usual case when there are multiple algorithms, the main
153+
algorithm will be SHA-256 and the compatibility algorithm will be SHA-1.
154+
51155
GIT
52156
---
53157
Part of the linkgit:git[1] suite

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1530,6 +1530,7 @@ UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o
15301530

15311531
RUST_SOURCES += src/hash.rs
15321532
RUST_SOURCES += src/lib.rs
1533+
RUST_SOURCES += src/loose.rs
15331534
RUST_SOURCES += src/varint.rs
15341535

15351536
GIT-VERSION-FILE: FORCE

src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
pub mod hash;
2+
pub mod loose;
23
pub mod varint;

0 commit comments

Comments
 (0)