You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/getting-started/example-datasets/laion.md
+17-16Lines changed: 17 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -100,15 +100,17 @@ INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv'
100
100
101
101
Note that the `id` column is just for illustration and is populated by the script with non-unique values.
102
102
103
-
## Run a brute-force ANN search (without ANN index) {#run-a-brute-force-ann-search-without-ann-index}
103
+
## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
104
104
105
-
To run a brute-force approximate nearest neighbor search, run:
105
+
To run a brute-force approximate vector search, run:
106
106
107
107
```sql
108
108
SELECT url, caption FROM laion ORDER BY cosineDistance(image_embedding, {target:Array(Float32)}) LIMIT10
109
109
```
110
110
111
-
`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random LEGO set picture as `target`.
111
+
`target` is an array of 512 elements and a client parameter.
112
+
A convenient way to obtain such arrays will be presented at the end of the article.
113
+
For now, we can run the embedding of a random LEGO set picture as `target`.
112
114
113
115
**Result**
114
116
@@ -129,33 +131,30 @@ SELECT url, caption FROM laion ORDER BY cosineDistance(image_embedding, {target:
129
131
10 rows in set. Elapsed: 4.605 sec. Processed 100.38 million rows, 309.98 GB (21.80 million rows/s., 67.31 GB/s.)
130
132
```
131
133
132
-
## Run a ANN with an ANN index {#run-a-ann-with-an-ann-index}
134
+
## Run an approximate vector similarity search with a vector simialrity index {#run-an-approximate-vector-similarity-search-with-a-vector-similarity-index}
133
135
134
-
Let's now define ANN indexes on the tables.
136
+
Let's now define two vector similarity indexes on the table.
135
137
136
138
```sql
137
-
SET enable_vector_similarity_index =1;
138
-
139
139
ALTERTABLE laion ADD INDEX image_index image_embedding TYPE vector_similarity('hnsw', 'cosineDistance', 512, 'bf16', 64, 256)
140
-
141
140
ALTERTABLE laion ADD INDEX text_index text_embedding TYPE vector_similarity('hnsw', 'cosineDistance', 512, 'bf16', 64, 256)
142
-
143
141
```
144
142
145
-
Parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md). The above index definition specifies a `hnsw' index using the `cosine distance` as the distance metric with the `hnsw_max_connections_per_layer` parameter set to 64 and the `hnsw_candidate_list_size_for_construction` parameter set to 256. The index uses `bf16` as quantization to optimize memory usage.
143
+
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
144
+
The above index definition specifies a HNSW index using the "cosine distance" as distance metric with the parameter "hnsw_max_connections_per_layer" set to 64 and parameter "hnsw_candidate_list_size_for_construction" set to 256.
145
+
The index uses half-precision brain floats (bfloat16) as quantization to optimize memory usage.
146
146
147
-
To build and materialize the index, execute these statements :
147
+
To build and materialize the index, run these statements :
148
148
149
149
```sql
150
150
ALTERTABLE laion MATERIALIZE INDEX image_index;
151
-
152
151
ALTERTABLE laion MATERIALIZE INDEX text_index;
153
-
154
152
```
155
153
156
-
Building and saving the index could take a few minutes or even hours depending on the number of rows and HNSW index parameters.
154
+
Building and saving the index could take a few minutes or even hours, depending on the number of rows and HNSW index parameters.
155
+
156
+
To perform a vector search, just execute the same query again:
157
157
158
-
To now perform an ANN search, just execute the same query again :
159
158
```sql
160
159
SELECT url, caption FROM laion ORDER BY cosineDistance(image_embedding, {target:Array(Float32)}) LIMIT10
161
160
```
@@ -179,7 +178,9 @@ SELECT url, caption FROM laion ORDER BY cosineDistance(image_embedding, {target:
179
178
10 rows in set. Elapsed: 0.019 sec. Processed 137.27 thousand rows, 24.42 MB (7.38 million rows/s., 1.31 GB/s.)
180
179
```
181
180
182
-
The query latency decreased significantly because the nearest neighbours were retrieved using the vector index. ANN search using a vector index may return results that differ slightly from the exact KNN search results. HNSW index can potentially achieve a `recall` score close to `1` by careful selection of HNSW parameters and evaluating index quality.
181
+
The query latency decreased significantly because the nearest neighbours were retrieved using the vector index.
182
+
Vector similarity search using a vector similarity index may return results that differ slightly from the brute-force search results.
183
+
An HNSW index can potentially achieve a recall close to 1 (same accuracy as brute force search) with a careful selection of the HNSW parameters and evaluating the index quality.
183
184
184
185
## Creating embeddings with UDFs {#creating-embeddings-with-udfs}
0 commit comments