Skip to content

Commit e3f3e48

Browse files
authored
Update laion.md
1 parent ee2d16d commit e3f3e48

File tree

1 file changed

+17
-16
lines changed
  • docs/getting-started/example-datasets

1 file changed

+17
-16
lines changed

docs/getting-started/example-datasets/laion.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -100,15 +100,17 @@ INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv'
100100

101101
Note that the `id` column is just for illustration and is populated by the script with non-unique values.
102102

103-
## Run a brute-force ANN search (without ANN index) {#run-a-brute-force-ann-search-without-ann-index}
103+
## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
104104

105-
To run a brute-force approximate nearest neighbor search, run:
105+
To run a brute-force approximate vector search, run:
106106

107107
```sql
108108
SELECT url, caption FROM laion ORDER BY cosineDistance(image_embedding, {target:Array(Float32)}) LIMIT 10
109109
```
110110

111-
`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random LEGO set picture as `target`.
111+
`target` is an array of 512 elements and a client parameter.
112+
A convenient way to obtain such arrays will be presented at the end of the article.
113+
For now, we can run the embedding of a random LEGO set picture as `target`.
112114

113115
**Result**
114116

@@ -129,33 +131,30 @@ SELECT url, caption FROM laion ORDER BY cosineDistance(image_embedding, {target:
129131
10 rows in set. Elapsed: 4.605 sec. Processed 100.38 million rows, 309.98 GB (21.80 million rows/s., 67.31 GB/s.)
130132
```
131133

132-
## Run a ANN with an ANN index {#run-a-ann-with-an-ann-index}
134+
## Run an approximate vector similarity search with a vector simialrity index {#run-an-approximate-vector-similarity-search-with-a-vector-similarity-index}
133135

134-
Let's now define ANN indexes on the tables.
136+
Let's now define two vector similarity indexes on the table.
135137

136138
```sql
137-
SET enable_vector_similarity_index = 1;
138-
139139
ALTER TABLE laion ADD INDEX image_index image_embedding TYPE vector_similarity('hnsw', 'cosineDistance', 512, 'bf16', 64, 256)
140-
141140
ALTER TABLE laion ADD INDEX text_index text_embedding TYPE vector_similarity('hnsw', 'cosineDistance', 512, 'bf16', 64, 256)
142-
143141
```
144142

145-
Parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md). The above index definition specifies a `hnsw' index using the `cosine distance` as the distance metric with the `hnsw_max_connections_per_layer` parameter set to 64 and the `hnsw_candidate_list_size_for_construction` parameter set to 256. The index uses `bf16` as quantization to optimize memory usage.
143+
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
144+
The above index definition specifies a HNSW index using the "cosine distance" as distance metric with the parameter "hnsw_max_connections_per_layer" set to 64 and parameter "hnsw_candidate_list_size_for_construction" set to 256.
145+
The index uses half-precision brain floats (bfloat16) as quantization to optimize memory usage.
146146

147-
To build and materialize the index, execute these statements :
147+
To build and materialize the index, run these statements :
148148

149149
```sql
150150
ALTER TABLE laion MATERIALIZE INDEX image_index;
151-
152151
ALTER TABLE laion MATERIALIZE INDEX text_index;
153-
154152
```
155153

156-
Building and saving the index could take a few minutes or even hours depending on the number of rows and HNSW index parameters.
154+
Building and saving the index could take a few minutes or even hours, depending on the number of rows and HNSW index parameters.
155+
156+
To perform a vector search, just execute the same query again:
157157

158-
To now perform an ANN search, just execute the same query again :
159158
```sql
160159
SELECT url, caption FROM laion ORDER BY cosineDistance(image_embedding, {target:Array(Float32)}) LIMIT 10
161160
```
@@ -179,7 +178,9 @@ SELECT url, caption FROM laion ORDER BY cosineDistance(image_embedding, {target:
179178
10 rows in set. Elapsed: 0.019 sec. Processed 137.27 thousand rows, 24.42 MB (7.38 million rows/s., 1.31 GB/s.)
180179
```
181180

182-
The query latency decreased significantly because the nearest neighbours were retrieved using the vector index. ANN search using a vector index may return results that differ slightly from the exact KNN search results. HNSW index can potentially achieve a `recall` score close to `1` by careful selection of HNSW parameters and evaluating index quality.
181+
The query latency decreased significantly because the nearest neighbours were retrieved using the vector index.
182+
Vector similarity search using a vector similarity index may return results that differ slightly from the brute-force search results.
183+
An HNSW index can potentially achieve a recall close to 1 (same accuracy as brute force search) with a careful selection of the HNSW parameters and evaluating the index quality.
183184

184185
## Creating embeddings with UDFs {#creating-embeddings-with-udfs}
185186

0 commit comments

Comments
 (0)