You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using `text-embedding-3-large` model from OpenAI.
8
+
The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using the `text-embedding-3-large` model from OpenAI.
9
9
10
-
The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q & A application.
10
+
The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q&A application.
11
11
12
12
## Dataset details {#dataset-details}
13
13
14
-
The dataset contains 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M.
14
+
The dataset contains 26 `Parquet` files located on [huggingface.co](https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/). The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit this [Hugging Face page](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M).
15
15
16
16
## Create table {#create-table}
17
17
18
-
Create the `dbpedia` table to store the article id, title, text and embedding vector:
18
+
Create the `dbpedia` table to store the article id, title, text and embedding vector:
19
19
20
20
```sql
21
21
CREATETABLEdbpedia
@@ -30,13 +30,13 @@ CREATE TABLE dbpedia
30
30
31
31
## Load table {#load-table}
32
32
33
-
To load the dataset from all Parquet files, run the following shell command:
33
+
To load the dataset from all Parquet files, run the following shell command:
20 rows in set. Elapsed: 0.261 sec. Processed 1.00 million rows, 6.22 GB (3.84 million rows/s., 23.81 GB/s.)
110
112
```
111
113
112
114
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
113
115
Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute
114
116
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)
115
117
116
-
## Build Vector Similarity Index {#build-vector-similarity-index}
118
+
## Build a vector similarity index {#build-vector-similarity-index}
117
119
118
-
Run the following SQL to define and build a vector similarity index on the `vector` column:
120
+
Run the following SQL to define and build a vector similarity index on the `vector` column:
119
121
120
122
```sql
121
123
ALTER TABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512);
@@ -132,7 +134,7 @@ Building and saving the index could take a few minutes depending on number of CP
132
134
133
135
_Approximate Nearest Neighbours_ or ANN refers to group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
134
136
135
-
Once the vector similarity index has been built, vector search queries will automatically use the index:
137
+
Once the vector similarity index has been built, vector search queries will automatically use the index:
0 commit comments