Skip to content

Commit ee68aed

Browse files
committed
Tweaks.
1 parent 2fd3487 commit ee68aed

File tree

1 file changed

+26
-10
lines changed

1 file changed

+26
-10
lines changed

README.md

Lines changed: 26 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -116,25 +116,41 @@ The most common approach is to store each compressed string ordered in memory us
116116
a B+Tree if you need frequent insertions (coming in the next release).
117117
The frequency of reads and writes + business requirements will tell the best media and data structure to use.
118118

119-
If the data is ordered before compression and stored in-memory in a `byte[][]`, you can use the full power of the binary search directly in the compressed data
120-
through `FourBitBinarySearch`, `FiveBitBinarySearch`, and `SixBitBinarySearch`.
119+
If the data is ordered before compression and stored in-memory in a `byte[][]`, you can use the full power of the binary
120+
search directly in the compressed data through `FourBitBinarySearch`, `FiveBitBinarySearch`, and `SixBitBinarySearch`.
121121

122122
### Binary search
123123
Executing a binary search in compressed data is simple as:
124124
```java
125125
byte[][] compressedData = new byte[100000000][]; // Data for 100 million customers.
126-
127-
SixBitBinarySearch binary = new SixBitBinarySearch(compressedData, false);
126+
// ...
127+
SixBitBinarySearch binary = new SixBitBinarySearch(compressedData, false); // false == exact-match search.
128128
int index = binary.search("key");
129129
```
130-
But this is not a realistic use case. Let's walk through a real-world scenario:
130+
It is important to note that ```compressedData``` does not need to be completely filled. It could have 70 million entries,
131+
for example, and the binary search would still work. This is because the compressed data array typically has extra space
132+
to accommodate new entries (usually with some incremental ID implementation to avoid adding in the middle, but always at
133+
the end of the array), so unused slots (nulls) are placed at the end.
134+
135+
A more realistic approach is to organize your data with a unique prefix (usually an ID) and search for it. For example,
136+
imagine each customer data in ```compressedData``` is organized like this:
137+
```java
138+
// ID # FullName # PhoneNumber # Address
131139

132-
Imagine the company you are working with have 70 million customers. You can't create an array with that exact number of
133-
elements because otherwise you will have no space to add further customers to your data pool (usually with some incremental
134-
ID implementation to avoid adding in the middle, but always at the end of the array). In this case, we can extend the size
135-
to accommodate incoming customers by making the array bigger, like in the example above with:
136-
```byte[][] compressedData = new byte[100000000][]; // Data for 100 million customers.```
140+
"63821623849863628763#John Doe#(555) 555-1234#123 Main Street Anytown, CA 91234-5678"
141+
```
142+
We could find it like this:
143+
```java
144+
SixBitBinarySearch binary = new SixBitBinarySearch(compressedData, true); // true == prefix search.
145+
int index = binary.search("63821623849863628763#");
146+
147+
if (index >= 0) {
148+
byte[] found = compressedData[index];
149+
String decompressed = compressor.decompress(found);
150+
}
151+
```
137152

153+
In case you used a custom character set to compress
138154

139155
### B+Tree
140156

0 commit comments

Comments
 (0)