|
| 1 | +Binary Interpolative Coding |
| 2 | +==== |
| 3 | + |
| 4 | +A C++ library implementing the *Binary Interpolative Coding* compression algorithm invented by Alistair Moffat and Lang Stuiver [1]. |
| 5 | + |
| 6 | +The algorithm can be used to compress sorted integer sequences (here, |
| 7 | +assumed to be increasing). |
| 8 | + |
| 9 | +The implementation comes in different flavours: |
| 10 | +it can be specified the use of |
| 11 | +simple *binary* codes, *left-most minimal* codes and *centered minimal* codes. |
| 12 | +Additionally, the implementation is *run-aware*, i.e., |
| 13 | +it optimizes encoding/decoding of runs of consecutive identifiers. |
| 14 | + |
| 15 | +##### Table of contents |
| 16 | +* [Compiling the code](#compiling-the-code) |
| 17 | +* [Quick Start](#quick-start) |
| 18 | +* [Encoding/decoding a collection of sequences](#encoding/decoding-a-collection-of-sequences) |
| 19 | +* [Benchmark](#benchmark) |
| 20 | +* [Author](#author) |
| 21 | +* [References](#references) |
| 22 | + |
| 23 | +Compiling the code |
| 24 | +------------------ |
| 25 | + |
| 26 | +The code is tested on Linux with `gcc` 7.3.0 and on Mac 10.14 with `clang` 10.0.0. |
| 27 | +To build the code, [`CMake`](https://cmake.org/) is required. |
| 28 | + |
| 29 | +Clone the repository with |
| 30 | + |
| 31 | + $ git clone --recursive https://github.com/jermp/interpolative_coding.git |
| 32 | + |
| 33 | +If you have cloned the repository without `--recursive`, you will need to perform the following commands before |
| 34 | +compiling: |
| 35 | + |
| 36 | + $ git submodule init |
| 37 | + $ git submodule update |
| 38 | + |
| 39 | +To compile the code for a release environment *and* best performance (see file `CMakeLists.txt` for the used compilation flags), do: |
| 40 | + |
| 41 | + $ mkdir build |
| 42 | + $ cd build |
| 43 | + $ cmake .. -DRUNAWARE=On |
| 44 | + $ make |
| 45 | + |
| 46 | +Hint: Use `make -j4` to compile the library in parallel using, e.g., 4 jobs. |
| 47 | + |
| 48 | +For a testing environment, use the following instead: |
| 49 | + |
| 50 | + $ mkdir debug_build |
| 51 | + $ cd debug_build |
| 52 | + $ cmake .. -DCMAKE_BUILD_TYPE=Debug -DUSE_SANITIZERS=On |
| 53 | + $ make |
| 54 | + |
| 55 | +Quick Start |
| 56 | +------- |
| 57 | + |
| 58 | +For a quick start, see the source file `test/example.cpp`. |
| 59 | +After compilation, run this example with |
| 60 | + |
| 61 | + $ ./example |
| 62 | + |
| 63 | +A simpler variation is shown below. |
| 64 | + |
| 65 | +```C++ |
| 66 | +#include <iostream> |
| 67 | + |
| 68 | +#include "interpolative_coding.hpp" |
| 69 | +using namespace bic; |
| 70 | + |
| 71 | +template <typename BinaryCode> |
| 72 | +void test(std::vector<uint32_t> const& in) { |
| 73 | + std::cout << "to be encoded:\n"; |
| 74 | + for (auto x : in) { |
| 75 | + std::cout << x << " "; |
| 76 | + } |
| 77 | + std::cout << std::endl; |
| 78 | + |
| 79 | + uint32_t n = in.size(); |
| 80 | + |
| 81 | + encoder<typename BinaryCode::writer> enc; |
| 82 | + enc.encode(in.data(), n); |
| 83 | + |
| 84 | + std::vector<uint32_t> out(n); |
| 85 | + decoder<typename BinaryCode::reader> dec; |
| 86 | + uint32_t m = dec.decode(enc.bits().data(), out.data()); |
| 87 | + assert(m == n); |
| 88 | + |
| 89 | + std::cout << "decoded " << m << " values" << std::endl; |
| 90 | + std::cout << "total bits " << enc.num_bits() << std::endl; |
| 91 | + std::cout << static_cast<double>(enc.num_bits()) / m << " bits x key" |
| 92 | + << std::endl; |
| 93 | + |
| 94 | + std::cout << "decoded:\n"; |
| 95 | + for (auto x : out) { |
| 96 | + std::cout << x << " "; |
| 97 | + } |
| 98 | + std::cout << std::endl; |
| 99 | +} |
| 100 | + |
| 101 | +int main(int argc, char** argv) { |
| 102 | + if (argc < 2) { |
| 103 | + std::cerr << argv[0] << " binary_code_type" << std::endl; |
| 104 | + return 1; |
| 105 | + } |
| 106 | + |
| 107 | + std::vector<uint32_t> in = {3, 4, 7, 13, 14, 15, 21, 25, 36, 38, 54, 62}; |
| 108 | + |
| 109 | + std::string type(argv[1]); |
| 110 | + |
| 111 | + if (type == "binary") { |
| 112 | + test<binary>(in); |
| 113 | + } else if (type == "leftmost_minimal") { |
| 114 | + test<leftmost_minimal>(in); |
| 115 | + } else if (type == "centered_minimal") { |
| 116 | + test<centered_minimal>(in); |
| 117 | + } else { |
| 118 | + std::cerr << "unknown type '" << type << "'" << std::endl; |
| 119 | + return 1; |
| 120 | + } |
| 121 | + |
| 122 | + return 0; |
| 123 | +} |
| 124 | +``` |
| 125 | +
|
| 126 | +Encoding/decoding a collection of sequences |
| 127 | +---------------------------------- |
| 128 | +
|
| 129 | +Typically, we want to build all the sequences from |
| 130 | +a collection. |
| 131 | +In this case, we assume that the input collection |
| 132 | +is a binary file with all the sequences being written |
| 133 | +as 32-bit integers. In this library, we follow the |
| 134 | +input data format of the [`ds2i`](https://github.com/ot/ds2i) library: |
| 135 | +each sequence is prefixed by an additional |
| 136 | +32-bit integer representing the size of the sequence. |
| 137 | +The collection file starts with a singleton sequence |
| 138 | +containing the universe of representation of the sequences, i.e., the maximum representable value. |
| 139 | +
|
| 140 | +We also assume all sequences are *increasing*. |
| 141 | +
|
| 142 | +The file `data/test_collection.docs` represents an example of |
| 143 | +such organization. |
| 144 | +
|
| 145 | +To encode all the sequences from this file, do: |
| 146 | +
|
| 147 | + $ ./encode leftmost_minimal ../data/test_collection.docs -o test.bin |
| 148 | +
|
| 149 | +To decode all the sequences from the encoded file `test.bin`, do: |
| 150 | +
|
| 151 | + $ ./decode leftmost_minimal test.bin |
| 152 | +
|
| 153 | +To check correctness of the implementation, use: |
| 154 | +
|
| 155 | + $ ./check leftmost_minimal ../data/test_collection.docs test.bin |
| 156 | +
|
| 157 | +which will compare every decode integer against the input collection. |
| 158 | +
|
| 159 | +Benchmark |
| 160 | +------ |
| 161 | +For this benchmark we used the whole Gov2 datasets, containing |
| 162 | +5,742,630,292 integers in 35,636,425 sequences. |
| 163 | +
|
| 164 | +We report the average number of bits per integer (bpi) |
| 165 | +and nanoseconds spent per decoded integer (with and without the |
| 166 | +run-aware optimization). |
| 167 | +
|
| 168 | +Time measurements were taken using a Linux 4.4.0 server machine with |
| 169 | +an Intel i7-7700 CPU (@3.6 GHz) and 64 GB of RAM. |
| 170 | +The code was compiled with gcc 7.3.0 with all optimizations |
| 171 | +(see also `CMakeLists.txt`). |
| 172 | +
|
| 173 | +|**Method** |**bpi** | **ns/int (run-aware)** | **ns/int (not run-aware)**| |
| 174 | +|:-----------------|:------:|:-----------------------:|:-------------------------:| |
| 175 | +|simple |3.532 | 3.45 | 4.65 | |
| 176 | +|left-most minimal |3.362 | 5.78 | 7.07 | |
| 177 | +|centered minimal |3.361 | 5.78 | 7.07 | |
| 178 | +
|
| 179 | +Author |
| 180 | +------ |
| 181 | +* [Giulio Ermanno Pibiri](http://pages.di.unipi.it/pibiri/), <giulio.ermanno.pibiri@isti.cnr.it> |
| 182 | +
|
| 183 | +References |
| 184 | +------- |
| 185 | +* [1] Alistair Moffat and Lang Stuiver. 2000. Binary Interpolative Coding for Effective Index Compression. Information Retrieval Journal 3, 1 (2000), 25 – 47. |
0 commit comments