Skip to content

Commit 4d8e006

Browse files
committed
Tweaks.
1 parent edebe5d commit 4d8e006

File tree

1 file changed

+45
-46
lines changed

1 file changed

+45
-46
lines changed

README.md

Lines changed: 45 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,7 @@
22
Ultra-fast, zero-allocation string compression library. Up to 50% memory reduction.
33

44
Fast! Tiny milliseconds to compress a 10 MB string. Check out the benchmarks.<br/>
5-
Well tested! See the test directory for usage examples and edge cases.
6-
5+
Thoroughly tested! See the test directory for usage examples and edge cases.
76
```java
87
String data = "Assume this is a 100 MB string...";
98
byte[] c;
@@ -35,26 +34,26 @@ c = new SixBitAsciiCompressor().compress(data); // c is 75 MB.
3534
```java
3635
implementation("io.github.dannemann:java-string-compressor:1.0.0")
3736
```
38-
Or download the lastest JAR from: https://github.com/Dannemann/java-string-compressor/releases
37+
Or download the latest JAR from: https://github.com/Dannemann/java-string-compressor/releases
3938

4039
## Documentation
41-
This library exits to quickly compress a massive volume of strings.
42-
Very useful if you need massive data allocated in memory for quick access or compacted for storage.
43-
We achieve this by removing all unnecessary bits from a character. But how?
40+
This library exists to quickly compress massive volumes of strings.
41+
It is very useful when you need large datasets allocated in memory for quick access or compacted for storage.
42+
We achieve this by removing all unnecessary bits from each character. But how?
4443

4544
An ASCII character is represented by 8 bits: `00000000` to `11111111`.
46-
This gives us 128 different slots to represent characters.
47-
But a lot of times we do not need all those characters, only a small sub-set of them.
48-
For example, if your data only has numbers (0-9) and a few punctuations, 16 different characters can be enough to
45+
This gives us 128 different slots to represent characters.
46+
However, sometimes we need only a small subset of those characters rather than the entire set.
47+
For example, if your data only contains numbers (0-9) and a few punctuation marks, 16 different characters can be enough to
4948
represent them, and we only need 4 bits (`0000` to `1111`) to represent 16 characters.
50-
But if your data only has letters (A-Z, like customer names), a set of 32 different characters is enough, which can be
49+
If your data only contains letters (A-Z, like customer names), a set of 32 different characters is sufficient, which can be
5150
represented by 5 bits.
52-
And if you need both, 6 bits are enough.
51+
But if you need both letters and numbers, 6 bits are sufficient.
5352
This way we can remove those unnecessary bits and store only the ones we need.
54-
And this is exactly was this library do.
53+
This is exactly what this library does.
5554

56-
Another important feature is searching. This library not only supports compacting, but also binary searching on the
57-
compacted data itself without deflating it, which will be explained later.
55+
Another important feature is searching. This library not only supports compression, but also binary searching on the
56+
compressed data itself without decompressing it, which will be explained later.
5857

5958
To compress a string, you can easily use either `FourBitAsciiCompressor`, `FiveBitAsciiCompressor`, or `SixBitAsciiCompressor`.
6059

@@ -64,7 +63,7 @@ var compressor = new SixBitAsciiCompressor();
6463
```
6564

6665
#### Defining your custom character set
67-
Each compressor have a set of default supported characters which are defined in fields
66+
Each compressor has a set of default supported characters which are defined in fields
6867
`FourBitAsciiCompressor.DEFAULT_4BIT_CHARSET`, `FiveBitAsciiCompressor.DEFAULT_5BIT_CHARSET`, and `SixBitAsciiCompressor.DEFAULT_6BIT_CHARSET`.
6968
If you need a custom character set, use constructors with parameter `supportedCharset`:
7069
```java
@@ -75,20 +74,20 @@ var compressor = new FourBitAsciiCompressor(myCustom4BitCharset);
7574
**Important:** The order in which you list characters in this array matters, as it defines the lexicographic
7675
order the binary search will follow. It's good practice to define your custom charset in standard ASCII order, like the example above.
7776

78-
#### Catching invalid characters (useful for testing an debugging)
77+
#### Catching invalid characters (useful for testing and debugging)
7978
It’s useful to validate the input and throw errors when invalid characters are found.
8079
You can enable character validation by using any constructor with `throwException` parameter.
81-
Validations aren't recommended for production because you will probably be allocating massive amounts of gigabytes, and
82-
you don't want a single invalid character to halt the whole processes.
80+
Validations are not recommended for production because you will likely be allocating massive amounts of gigabytes, and
81+
you don't want a single invalid character to halt the entire process.
8382
It’s better to occasionally display an incorrect character than to abort the entire operation.
8483
```java
8584
public FiveBitAsciiCompressor(boolean throwException)
8685
```
8786

88-
#### Preserving source byte arrays (useful for testing an debugging)
89-
Whenever possible, try to read straight bytes from your input source without creating `String` objects from them.
90-
This will keep your whole compressing process zero-allocation (like this library), which boosts performance and memory saving.
91-
But, by dealing directly with `byte[]` instead of `Strings`, you will notice that the compressor overwrites the original
87+
#### Preserving source byte arrays (useful for testing and debugging)
88+
Whenever possible, try to read bytes directly from your input source without creating `String` objects from them.
89+
This will keep your entire compression process zero-allocation (like this library), which boosts performance and reduces memory usage.
90+
However, when dealing directly with `byte[]` instead of `Strings`, you will notice that the compressor overwrites the original
9291
input byte array to minimize memory usage, making it unusable.
9392
To avoid this behavior and compress a copy of the original, enable input preservation by using any constructor with `preserveOriginal` parameter.
9493
```java
@@ -103,37 +102,37 @@ Once the compressor is instantiated, the compress and decompress process is stra
103102
String string = new String(decompressed, StandardCharsets.ISO_8859_1);
104103
// String string = AsciiCompressor.getString(decompressed); // Same as above. Recommended.
105104
```
106-
We recommend using `AsciiCompressor.getString(byte[])` because the method can be updated whenever a most efficient way to encode a `String` is found.
105+
We recommend using `AsciiCompressor.getString(byte[])` because the method can be updated whenever a more efficient way to encode a `String` is found.
107106

108107
**In case you can't work directly with byte arrays and need `String` objects for compression:**
109-
To extract ASCII bytes from a `String` in the most efficient way (for compression), do `AsciiCompressor.getBytes(String)`.
110-
But the overloaded version `compressor.compress(String)` already calls it automatically, so, just call the overloaded version.
108+
To extract ASCII bytes from a `String` in the most efficient way (for compression), use `AsciiCompressor.getBytes(String)`.
109+
However, the overloaded version `compressor.compress(String)` already calls it automatically, so just call the overloaded version.
111110

112111
### Where to store the compressed data?
113-
In its purest form, a `String` is just a byte array (`byte[]`), and a compressed `String` couldn't be different.
112+
In its purest form, a `String` is just a byte array (`byte[]`), and a compressed `String` is no different.
114113
You can store it anywhere you would store a `byte[]`. If you are compressing millions of different entries, a very common
115114
approach is to store each compressed string ordered in memory using a `byte[][]` (for binary search) or a B+Tree if you
116-
need frequent insertions (coming in the next release). The frequency of reads and writes + business requirements will
117-
tell the best media and data structure to use.
115+
need frequent insertions (coming in the next release). The frequency of reads and writes plus business requirements will
116+
determine the best storage medium and data structure to use.
118117

119-
If the data is ordered before compression and stored in-memory in a `byte[][]` as mentioned above, you can use the full power of the binary
120-
search directly in the compressed data through `FourBitBinarySearch`, `FiveBitBinarySearch`, and `SixBitBinarySearch`.
118+
If the data is ordered before compression and stored in memory in a `byte[][]` as mentioned above, you can use the full power of binary
119+
search directly on the compressed data through `FourBitBinarySearch`, `FiveBitBinarySearch`, and `SixBitBinarySearch`.
121120

122121
### Binary search
123-
Executing a binary search in compressed data is simple as:
122+
Executing a binary search on compressed data is as simple as:
124123
```java
125124
byte[][] compressedData = new byte[100000000][]; // Data for 100 million customers.
126125
// ...
127126
SixBitBinarySearch binary = new SixBitBinarySearch(compressedData, false); // false == exact-match search.
128127
int index = binary.search("key");
129128
```
130-
It is important to note that ```compressedData``` does not need to be completely filled. It could have 70 million entries
129+
It is important to note that `compressedData` does not need to be completely filled. It could have 70 million entries
131130
and the binary search would still work. This is because the array of compressed data typically has extra space to
132-
accommodate new entries (usually with some incremental ID implementation to avoid adding in the middle, but always at
131+
accommodate new entries (usually with some incremental ID implementation to avoid insertions in the middle, but always at
133132
the end of the array), so unused slots (nulls) are placed at the end.
134133

135134
A more realistic approach is to organize your data with a unique prefix (usually an ID) and search for it. For example,
136-
imagine each customer data in ```compressedData``` is organized like this:
135+
imagine each customer data entry in `compressedData` is organized like this:
137136
```java
138137
// ID # FullName # PhoneNumber # Address
139138

@@ -148,7 +147,7 @@ if (index >= 0) {
148147
byte[] found = compressedData[index];
149148
String decompressed = compressor.decompress(found);
150149
```
151-
In case you used are using a custom character set to compress the data, you need to pass it through the binary search constructor:
150+
If you are using a custom character set to compress the data, you need to pass it to the binary search constructor:
152151
```java
153152
public FiveBitBinarySearch(byte[][] compressedData, boolean prefixSearch, byte[] charset)
154153
```
@@ -157,22 +156,22 @@ public FiveBitBinarySearch(byte[][] compressedData, boolean prefixSearch, byte[]
157156
Coming in the next release.
158157

159158
### Other
160-
Do not forget to check the JavaDocs with further information about each member.
159+
Don't forget to check the JavaDocs for further information about each member.
161160
Also check the test directory for additional examples.
162161
163162
### Logging
164-
If you need logging, search for libraries like ZeroLog, ChronicleLog, Log4j 2 Async Loggers, and other similar tools
165-
(we did not test any of those). You will need a fast log library, or it can become a bottleneck.
163+
If you need logging, consider libraries like ZeroLog, ChronicleLog, Log4j 2 Async Loggers, and other similar tools
164+
(we have not tested any of these). You will need a fast logging library, or it can become a bottleneck.
166165
167166
### Bulk / Batch compression
168-
In some rare cases you need to fetch your data in batches from a remote location or another third party actor.
169-
java-string-compressor provides both, `BulkCompressor` and `ManagedBulkCompressor` specifically for this task.
170-
They help you automatize the process of adding each batch to the correct position in the destination array where the
171-
compressed data will be stored. Both currently supports `byte[][]` as destination for the compressed data.
172-
173-
`BulkCompressor` is a "lower-level" utility where you should manage where each compacted string should be added in
174-
the target `byte[][]`. In the other hand, `ManagedBulkCompressor` encapsulates and automatizes this process, avoiding you
175-
from handle array positions and bounds. This is why we recommend `ManagedBulkCompressor` (which uses a `BulkCompressor` internally).
167+
In some cases you may need to fetch your data in batches from a remote location or another third-party service.
168+
java-string-compressor provides both `BulkCompressor` and `ManagedBulkCompressor` specifically for this task.
169+
They help you automate the process of adding each batch to the correct position in the destination array where the
170+
compressed data will be stored. Both currently support `byte[][]` as the destination for the compressed data.
171+
172+
`BulkCompressor` is a "lower-level" utility where you must manage where each compressed string should be added in
173+
the target `byte[][]`. On the other hand, `ManagedBulkCompressor` encapsulates and automates this process, freeing you
174+
from handling array positions and bounds. This is why we recommend `ManagedBulkCompressor` (which uses a `BulkCompressor` internally).
176175
177176
Both bulk compressors loop through the data in parallel by calling `IntStream.range().parallel()`.
178177
```java

0 commit comments

Comments
 (0)