From e6207b56f4227c8eae2cd1130cebc000edda14f8 Mon Sep 17 00:00:00 2001 From: Daniel Lemire Date: Sun, 6 Jul 2025 13:42:48 -0400 Subject: [PATCH] add Roaring bitmap definitions. Roaring bitmaps are compressed bitmaps that are widely used for search engine or database indexing (see https://roaringbitmap.org). There are independent implementations in many languages such C, Java, Go and Rust. --- database/roaringbitmap.ksy | 201 +++++++++++++++++++++++++++++++++++ database/roaringbitmap64.ksy | 35 ++++++ 2 files changed, 236 insertions(+) create mode 100644 database/roaringbitmap.ksy create mode 100644 database/roaringbitmap64.ksy diff --git a/database/roaringbitmap.ksy b/database/roaringbitmap.ksy new file mode 100644 index 000000000..9d2c47e53 --- /dev/null +++ b/database/roaringbitmap.ksy @@ -0,0 +1,201 @@ +doc: | + Roaring bitmaps are compressed bitmaps which tend to outperform + conventional compressed bitmaps such as WAH, EWAH or Concise. + + They are used by several important systems: + + * Apache Lucene and derivative systems such as Solr and Elastic + * Metamarkets' Druid + * Apache Spark + * Apache Hive + * Apache Tez + * Netflix Atlas + * LinkedIn Pinot + * and many others + + Roaring bitmaps are designed to store sets of 32-bit unsigned integers efficiently. + They use a two-level structure: the 32-bit integers are split into + 16-bit "chunks" (most significant bits + least significant bits). + Each chunk is stored in a container, of which there are three types: + array, bitset, and run containers. + +meta: + id: roaringbitmap + title: Roaring Bitmap Portable Format + license: Apache-2.0 + endian: le + +seq: + - id: magic + type: u2 + enum: cookie + doc: | + Magic cookie value that identifies the type of Roaring Bitmap format. + 12346 (SERIAL_COOKIE_NO_RUNCONTAINER) means no run containers are used. + 12347 (SERIAL_COOKIE) means run containers may be present. + + - id: header + type: + switch-on: magic + cases: + 'cookie::no_runs': header_no_runs + 'cookie::with_runs': header_with_runs + doc: | + Header structure that follows the magic cookie value. + The structure differs depending on whether run containers are used or not. + + - id: container_meta + type: container_meta + repeat: expr + repeat-expr: num_containers + doc: | + Descriptive header containing the key (16 most significant bits) and + cardinality of each container. + + - id: offset_header + if: (not has_runs) or (num_containers >= 4) + type: u4 + repeat: expr + repeat-expr: num_containers + doc: | + Offset header containing the byte offsets of each container from the beginning + of the stream. This is included if either: + 1. No run containers are present, or + 2. There are at least NO_OFFSET_THRESHOLD (4) containers + + - id: containers + type: + switch-on: > + ( + has_runs + ? ( + header.as.run_bitset[_index / 8] & (1 << (_index % 8)) + ) != 0 + : false + ) + ? 1 + : container_meta[_index].cardinality_minus_1 + 1 <= 4096 ? 2 : 3 + cases: + 1: run_container + 2: array_container(container_meta[_index].cardinality_minus_1 + 1) + 3: bitset_container + repeat: expr + repeat-expr: num_containers + doc: | + The actual container data. The type is determined by: + - If run containers are allowed and the run_bitset indicates this container is a run container, use run_container (type 1) + - Otherwise, if the container's cardinality is <= 4096, use array_container (type 2) + - Otherwise, use bitset_container (type 3) + +instances: + has_runs: + value: magic == cookie::with_runs + doc: | + Computed field that indicates whether this Roaring bitmap may contain run containers. + + num_containers: + value: > + has_runs ? (header.as.num_containers_minus_1 + 1) : header.as.num_containers + doc: | + Computed field that returns the number of containers in the bitmap. + For backwards compatibility, the encoding differs depending on whether run containers are present. + +types: + header_no_runs: + doc: | + Header format for bitmaps that don't use run containers. + This contains a 32-bit value (SERIAL_COOKIE_NO_RUNCONTAINER) + followed by 32 bits for the number of containers. + seq: + - contents: [0, 0] + doc: Two zeros to complete the 32-bit cookie (after the initial 16-bit magic) + - id: num_containers + type: u4 + doc: Number of containers in this bitmap + + header_with_runs: + doc: | + Header format for bitmaps that may use run containers. + The 16-bit cookie's most significant bits store the number of containers minus 1. + A run container bitset follows, with a 1 bit indicating the container is a run container. + seq: + - id: num_containers_minus_1 + type: u2 + doc: Number of containers minus 1 (to allow encoding 65536 containers) + - id: run_bitset + size: (num_containers_minus_1 + 1 + 7) / 8 + doc: | + Bitset indicating which containers are run containers (1 bit) vs array/bitset (0 bit). + The least significant bit of the first byte corresponds to the first container. + + container_meta: + doc: | + Metadata for a single container, consisting of its key (16 most significant bits) + and its cardinality minus 1 (to allow encoding full 65536 cardinality). + seq: + - id: key + type: u2 + doc: Container key (16 most significant bits of the integers in this container) + - id: cardinality_minus_1 + type: u2 + doc: | + Container cardinality minus 1. This is used to determine whether a + container is an array container (cardinality <= 4096) or a bitset container. + + run_container: + doc: | + Run container format, storing sorted runs of consecutive integers. + More space-efficient for data with long consecutive runs of values. + Runs are non-overlapping and sorted. + seq: + - id: num_runs + type: u2 + doc: Number of runs in this container + - id: runs + type: run + repeat: expr + repeat-expr: num_runs + doc: Array of runs (start value and length pairs) + + run: + doc: | + A run of consecutive integers, represented by a starting value and a length. + seq: + - id: start_idx + type: u2 + doc: Starting value of the run + - id: count_minus_1 + type: u2 + doc: | + Length of the run minus 1. For example, a run of [11,12,13,14,15] + would be encoded as start_idx=11, count_minus_1=4. + + array_container: + doc: | + Array container storing a sorted array of 16-bit integers. + Used when the container has relatively few values (cardinality <= 4096). + params: + - id: num_values + type: u2 + doc: Number of values in this array container + seq: + - id: values + type: u2 + repeat: expr + repeat-expr: num_values + doc: Sorted array of 16-bit values in this container + + bitset_container: + doc: | + Bitset container using a 65536-bit bitset (8KB) to represent which + values are present. Used when the container has many values (cardinality > 4096). + seq: + - id: bitset + size: 8 * 1024 + doc: | + A dense 8KB bitset (2^16 bits). + +enums: + cookie: + 12346: no_runs + 12347: with_runs diff --git a/database/roaringbitmap64.ksy b/database/roaringbitmap64.ksy new file mode 100644 index 000000000..2a28e82d3 --- /dev/null +++ b/database/roaringbitmap64.ksy @@ -0,0 +1,35 @@ +doc: | + Some Roaring bitmap implementations may offer a 64-bit implementation. This section proposes a portable format, + compatible with some (but not all) 64-bit implementations. This format is naturally compatible with implementations + based on a conventional red-black-tree (as the serialization format is similar to the in-memory layout). The keys + would be 32-bit integers representing the most significant 32~bits of elements whereas the values of the tree are + 32-bit Roaring bitmaps. The 32-bit Roaring bitmaps represent the least significant bits of a set of elements. + +meta: + id: roaringbitmap64 + title: Roaring Bitmap Portable 64 Bit Format + license: Apache-2.0 + endian: le + imports: + - roaringbitmap + + +seq: + - id: num_buckets + type: u8 + doc: The number of sub-buckets (32 bit roaring bitmaps). + - id: buckets + type: bucket + repeat: expr + repeat-expr: num_buckets + +types: + bucket: + doc: For each sub-bucket, the upper 32 bits of the bucket, and a 32 bit roaring bitmap. + seq: + - id: key + type: u4 + doc: The upper 32 bits of the bucket. + - id: bitmap + type: roaringbitmap +