22Ultra-fast, zero-allocation string compression library. Up to 50% memory reduction.
33
44Fast! Tiny milliseconds to compress a 10 MB string. Check out the benchmarks.<br />
5- Well tested! See the test directory for usage examples and edge cases.
6-
5+ Thoroughly tested! See the test directory for usage examples and edge cases.
76``` java
87String data = " Assume this is a 100 MB string..." ;
98byte [] c;
@@ -35,26 +34,26 @@ c = new SixBitAsciiCompressor().compress(data); // c is 75 MB.
3534``` java
3635implementation(" io.github.dannemann:java-string-compressor:1.0.0" )
3736```
38- Or download the lastest JAR from: https://github.com/Dannemann/java-string-compressor/releases
37+ Or download the latest JAR from: https://github.com/Dannemann/java-string-compressor/releases
3938
4039## Documentation
41- This library exits to quickly compress a massive volume of strings.
42- Very useful if you need massive data allocated in memory for quick access or compacted for storage.
43- We achieve this by removing all unnecessary bits from a character. But how?
40+ This library exists to quickly compress massive volumes of strings.
41+ It is very useful when you need large datasets allocated in memory for quick access or compacted for storage.
42+ We achieve this by removing all unnecessary bits from each character. But how?
4443
4544An ASCII character is represented by 8 bits: ` 00000000 ` to ` 11111111 ` .
46- This gives us 128 different slots to represent characters.
47- But a lot of times we do not need all those characters, only a small sub-set of them .
48- For example, if your data only has numbers (0-9) and a few punctuations , 16 different characters can be enough to
45+ This gives us 128 different slots to represent characters.
46+ However, sometimes we need only a small subset of those characters rather than the entire set .
47+ For example, if your data only contains numbers (0-9) and a few punctuation marks , 16 different characters can be enough to
4948represent them, and we only need 4 bits (` 0000 ` to ` 1111 ` ) to represent 16 characters.
50- But if your data only has letters (A-Z, like customer names), a set of 32 different characters is enough , which can be
49+ If your data only contains letters (A-Z, like customer names), a set of 32 different characters is sufficient , which can be
5150represented by 5 bits.
52- And if you need both, 6 bits are enough .
51+ But if you need both letters and numbers , 6 bits are sufficient .
5352This way we can remove those unnecessary bits and store only the ones we need.
54- And this is exactly was this library do .
53+ This is exactly what this library does .
5554
56- Another important feature is searching. This library not only supports compacting , but also binary searching on the
57- compacted data itself without deflating it, which will be explained later.
55+ Another important feature is searching. This library not only supports compression , but also binary searching on the
56+ compressed data itself without decompressing it, which will be explained later.
5857
5958To compress a string, you can easily use either ` FourBitAsciiCompressor ` , ` FiveBitAsciiCompressor ` , or ` SixBitAsciiCompressor ` .
6059
@@ -64,7 +63,7 @@ var compressor = new SixBitAsciiCompressor();
6463```
6564
6665#### Defining your custom character set
67- Each compressor have a set of default supported characters which are defined in fields
66+ Each compressor has a set of default supported characters which are defined in fields
6867` FourBitAsciiCompressor.DEFAULT_4BIT_CHARSET ` , ` FiveBitAsciiCompressor.DEFAULT_5BIT_CHARSET ` , and ` SixBitAsciiCompressor.DEFAULT_6BIT_CHARSET ` .
6968If you need a custom character set, use constructors with parameter ` supportedCharset ` :
7069``` java
@@ -75,20 +74,20 @@ var compressor = new FourBitAsciiCompressor(myCustom4BitCharset);
7574** Important:** The order in which you list characters in this array matters, as it defines the lexicographic
7675order the binary search will follow. It's good practice to define your custom charset in standard ASCII order, like the example above.
7776
78- #### Catching invalid characters (useful for testing an debugging)
77+ #### Catching invalid characters (useful for testing and debugging)
7978It’s useful to validate the input and throw errors when invalid characters are found.
8079You can enable character validation by using any constructor with ` throwException ` parameter.
81- Validations aren't recommended for production because you will probably be allocating massive amounts of gigabytes, and
82- you don't want a single invalid character to halt the whole processes .
80+ Validations are not recommended for production because you will likely be allocating massive amounts of gigabytes, and
81+ you don't want a single invalid character to halt the entire process .
8382It’s better to occasionally display an incorrect character than to abort the entire operation.
8483``` java
8584public FiveBitAsciiCompressor(boolean throwException)
8685```
8786
88- #### Preserving source byte arrays (useful for testing an debugging)
89- Whenever possible, try to read straight bytes from your input source without creating ` String ` objects from them.
90- This will keep your whole compressing process zero-allocation (like this library), which boosts performance and memory saving .
91- But, by dealing directly with ` byte[] ` instead of ` Strings ` , you will notice that the compressor overwrites the original
87+ #### Preserving source byte arrays (useful for testing and debugging)
88+ Whenever possible, try to read bytes directly from your input source without creating ` String ` objects from them.
89+ This will keep your entire compression process zero-allocation (like this library), which boosts performance and reduces memory usage .
90+ However, when dealing directly with ` byte[] ` instead of ` Strings ` , you will notice that the compressor overwrites the original
9291input byte array to minimize memory usage, making it unusable.
9392To avoid this behavior and compress a copy of the original, enable input preservation by using any constructor with ` preserveOriginal ` parameter.
9493``` java
@@ -103,37 +102,37 @@ Once the compressor is instantiated, the compress and decompress process is stra
103102 String string = new String (decompressed, StandardCharsets . ISO_8859_1 );
104103// String string = AsciiCompressor.getString(decompressed); // Same as above. Recommended.
105104```
106- We recommend using ` AsciiCompressor.getString(byte[]) ` because the method can be updated whenever a most efficient way to encode a ` String ` is found.
105+ We recommend using ` AsciiCompressor.getString(byte[]) ` because the method can be updated whenever a more efficient way to encode a ` String ` is found.
107106
108107** In case you can't work directly with byte arrays and need ` String ` objects for compression:**
109- To extract ASCII bytes from a ` String ` in the most efficient way (for compression), do ` AsciiCompressor.getBytes(String) ` .
110- But the overloaded version ` compressor.compress(String) ` already calls it automatically, so, just call the overloaded version.
108+ To extract ASCII bytes from a ` String ` in the most efficient way (for compression), use ` AsciiCompressor.getBytes(String) ` .
109+ However, the overloaded version ` compressor.compress(String) ` already calls it automatically, so just call the overloaded version.
111110
112111### Where to store the compressed data?
113- In its purest form, a ` String ` is just a byte array (` byte[] ` ), and a compressed ` String ` couldn't be different.
112+ In its purest form, a ` String ` is just a byte array (` byte[] ` ), and a compressed ` String ` is no different.
114113You can store it anywhere you would store a ` byte[] ` . If you are compressing millions of different entries, a very common
115114approach is to store each compressed string ordered in memory using a ` byte[][] ` (for binary search) or a B+Tree if you
116- need frequent insertions (coming in the next release). The frequency of reads and writes + business requirements will
117- tell the best media and data structure to use.
115+ need frequent insertions (coming in the next release). The frequency of reads and writes plus business requirements will
116+ determine the best storage medium and data structure to use.
118117
119- If the data is ordered before compression and stored in- memory in a ` byte[][] ` as mentioned above, you can use the full power of the binary
120- search directly in the compressed data through ` FourBitBinarySearch ` , ` FiveBitBinarySearch ` , and ` SixBitBinarySearch ` .
118+ If the data is ordered before compression and stored in memory in a ` byte[][] ` as mentioned above, you can use the full power of binary
119+ search directly on the compressed data through ` FourBitBinarySearch ` , ` FiveBitBinarySearch ` , and ` SixBitBinarySearch ` .
121120
122121### Binary search
123- Executing a binary search in compressed data is simple as:
122+ Executing a binary search on compressed data is as simple as:
124123``` java
125124byte [][] compressedData = new byte [100000000 ][]; // Data for 100 million customers.
126125// ...
127126SixBitBinarySearch binary = new SixBitBinarySearch (compressedData, false ); // false == exact-match search.
128127int index = binary. search(" key" );
129128```
130- It is important to note that ``` compressedData `` ` does not need to be completely filled. It could have 70 million entries
129+ It is important to note that ` compressedData ` does not need to be completely filled. It could have 70 million entries
131130and the binary search would still work. This is because the array of compressed data typically has extra space to
132- accommodate new entries (usually with some incremental ID implementation to avoid adding in the middle, but always at
131+ accommodate new entries (usually with some incremental ID implementation to avoid insertions in the middle, but always at
133132the end of the array), so unused slots (nulls) are placed at the end.
134133
135134A more realistic approach is to organize your data with a unique prefix (usually an ID) and search for it. For example,
136- imagine each customer data in ``` compressedData `` ` is organized like this:
135+ imagine each customer data entry in ` compressedData ` is organized like this:
137136``` java
138137// ID # FullName # PhoneNumber # Address
139138
@@ -148,7 +147,7 @@ if (index >= 0) {
148147 byte [] found = compressedData[index];
149148 String decompressed = compressor. decompress(found);
150149```
151- In case you used are using a custom character set to compress the data, you need to pass it through the binary search constructor:
150+ If you are using a custom character set to compress the data, you need to pass it to the binary search constructor:
152151```java
153152public FiveBitBinarySearch(byte [][] compressedData, boolean prefixSearch, byte [] charset)
154153```
@@ -157,22 +156,22 @@ public FiveBitBinarySearch(byte[][] compressedData, boolean prefixSearch, byte[]
157156Coming in the next release.
158157
159158### Other
160- Do not forget to check the JavaDocs with further information about each member.
159+ Don ' t forget to check the JavaDocs for further information about each member.
161160Also check the test directory for additional examples.
162161
163162### Logging
164- If you need logging, search for libraries like ZeroLog , ChronicleLog , Log4j 2 Async Loggers , and other similar tools
165- (we did not test any of those ). You will need a fast log library, or it can become a bottleneck.
163+ If you need logging, consider libraries like ZeroLog, ChronicleLog, Log4j 2 Async Loggers, and other similar tools
164+ (we have not tested any of these ). You will need a fast logging library, or it can become a bottleneck.
166165
167166### Bulk / Batch compression
168- In some rare cases you need to fetch your data in batches from a remote location or another third party actor .
169- java- string- compressor provides both, `BulkCompressor ` and `ManagedBulkCompressor ` specifically for this task.
170- They help you automatize the process of adding each batch to the correct position in the destination array where the
171- compressed data will be stored. Both currently supports `byte [][]` as destination for the compressed data.
172-
173- `BulkCompressor ` is a " lower-level" utility where you should manage where each compacted string should be added in
174- the target `byte [][]`. In the other hand, `ManagedBulkCompressor ` encapsulates and automatizes this process, avoiding you
175- from handle array positions and bounds. This is why we recommend `ManagedBulkCompressor ` (which uses a `BulkCompressor ` internally).
167+ In some cases you may need to fetch your data in batches from a remote location or another third- party service .
168+ java-string-compressor provides both `BulkCompressor` and `ManagedBulkCompressor` specifically for this task.
169+ They help you automate the process of adding each batch to the correct position in the destination array where the
170+ compressed data will be stored. Both currently support `byte[][]` as the destination for the compressed data.
171+
172+ `BulkCompressor` is a "lower-level" utility where you must manage where each compressed string should be added in
173+ the target `byte[][]`. On the other hand, `ManagedBulkCompressor` encapsulates and automates this process, freeing you
174+ from handling array positions and bounds. This is why we recommend `ManagedBulkCompressor` (which uses a `BulkCompressor` internally).
176175
177176Both bulk compressors loop through the data in parallel by calling `IntStream.range().parallel()`.
178177```java
0 commit comments