You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -11,7 +11,7 @@ The dataset is an excellent starter dataset to understand vector embeddings, vec
11
11
12
12
## Dataset details {#dataset-details}
13
13
14
-
The dataset consists of 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M.
14
+
The dataset contains 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M.
15
15
16
16
## Create table {#create-table}
17
17
@@ -30,12 +30,13 @@ CREATE TABLE dbpedia
30
30
31
31
## Load table {#load-table}
32
32
33
-
To load the dataset from the Parquet files, run the following shell command :
33
+
To load the dataset from all Parquet files, run the following shell command :
20 rows inset. Elapsed: 0.261 sec. Processed 1.00 million rows, 6.22 GB (3.84 million rows/s., 23.81 GB/s.)
95
110
```
96
111
97
112
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
98
113
Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute
99
114
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)
100
115
116
+
## Build Vector Similarity Index {#build-vector-similarity-index}
117
+
118
+
Run the following SQL to define and build a vector similarity index on the `vector` column :
119
+
120
+
```sql
121
+
ALTERTABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512);
122
+
101
123
124
+
ALTERTABLE dbpedia MATERIALIZE INDEX vector_index;
125
+
```
126
+
127
+
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
128
+
129
+
Building and saving the index could take a few minutes depending on number of CPU cores available and the storage bandwidth.
130
+
131
+
## Perform ANN search {#perform-ann-search}
132
+
133
+
_Approximate Nearest Neighbours_ or ANN refers to group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
134
+
135
+
Once the vector similarity index has been built, vector search queries will automatically use the index :
0 commit comments