Skip to content

Commit 4eff3ac

Browse files
authored
Minor formatting fixes
1 parent caf7ac9 commit 4eff3ac

File tree

1 file changed

+21
-19
lines changed

1 file changed

+21
-19
lines changed

docs/getting-started/example-datasets/dbpedia.md

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,17 @@ slug: /getting-started/example-datasets/dbpedia-dataset
55
title: 'dbpedia dataset'
66
---
77

8-
The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using `text-embedding-3-large` model from OpenAI.
8+
The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using the `text-embedding-3-large` model from OpenAI.
99

10-
The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q & A application.
10+
The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q&A application.
1111

1212
## Dataset details {#dataset-details}
1313

14-
The dataset contains 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M.
14+
The dataset contains 26 `Parquet` files located on [huggingface.co](https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/). The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit this [Hugging Face page](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M).
1515

1616
## Create table {#create-table}
1717

18-
Create the `dbpedia` table to store the article id, title, text and embedding vector :
18+
Create the `dbpedia` table to store the article id, title, text and embedding vector:
1919

2020
```sql
2121
CREATE TABLE dbpedia
@@ -30,13 +30,13 @@ CREATE TABLE dbpedia
3030

3131
## Load table {#load-table}
3232

33-
To load the dataset from all Parquet files, run the following shell command :
33+
To load the dataset from all Parquet files, run the following shell command:
3434

3535
```shell
3636
$ seq 0 25 | xargs -P1 -I{} clickhouse client -q "INSERT INTO dbpedia SELECT _id, title, text, \"text-embedding-3-large-1536-embedding\" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/{}.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;"
3737
```
3838

39-
Alternatively, individual SQL statements can be run as shown below to load each of the 25 Parquet files :
39+
Alternatively, individual SQL statements can be run as shown below to load each of the 25 Parquet files:
4040

4141
```sql
4242
INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/0.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;
@@ -46,7 +46,7 @@ INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedd
4646

4747
```
4848

49-
Verify that 1 million rows are seen in the `dbpedia` table :
49+
Verify that 1 million rows are seen in the `dbpedia` table:
5050

5151
```sql
5252
SELECT count(*)
@@ -57,14 +57,15 @@ FROM dbpedia
5757
└─────────┘
5858
```
5959

60-
## Semantic Search {#semantic-search}
60+
## Semantic search {#semantic-search}
6161

62-
Recommended reading : https://platform.openai.com/docs/guides/embeddings
62+
Recommended reading: ["Vector embeddings
63+
" OpenAPI guide](https://platform.openai.com/docs/guides/embeddings)
6364

64-
Semantic search (or referred to as _similarity search_) using vector embeddings involves
65-
the following steps :
65+
Semantic search (also referred to as _similarity search_) using vector embeddings involves
66+
the following steps:
6667

67-
- Accept a search query from user in natural language e.g _"Tell me some scenic rail journeys”_, _“Suspense novels set in Europe”_ etc
68+
- Accept a search query from a user in natural language e.g _"Tell me about some scenic rail journeys”_, _“Suspense novels set in Europe”_ etc
6869
- Generate embedding vector for the search query using the LLM model
6970
- Find nearest neighbours to the search embedding vector in the dataset
7071

@@ -73,18 +74,18 @@ The retrieved results are the key input to Retrieval Augmented Generation (RAG)
7374

7475
## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
7576

76-
KNN (k - Nearest Neighbours) search or brute force search involves calculating distance of each vector in the dataset
77+
KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset
7778
to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpedia` dataset,
7879
a quick technique to visually observe semantic search is to use embedding vectors from the dataset itself as search
79-
vectors. Example :
80+
vectors. For example:
8081

81-
```sql
82+
```sql title="Query"
8283
SELECT id, title
8384
FROM dbpedia
8485
ORDER BY cosineDistance(vector, ( SELECT vector FROM dbpedia WHERE id = '<dbpedia:The_Remains_of_the_Day>') ) ASC
8586
LIMIT 20
8687

87-
┌─id────────────────────────────────────────┬─title───────────────────────────┐
88+
```response title="Response" ┌─id────────────────────────────────────────┬─title───────────────────────────┐
8889
1. │ <dbpedia:The_Remains_of_the_Day> │ The Remains of the Day │
8990
2. │ <dbpedia:The_Remains_of_the_Day_(film)> │ The Remains of the Day (film) │
9091
3. │ <dbpedia:Never_Let_Me_Go_(novel)> │ Never Let Me Go (novel) │
@@ -106,16 +107,17 @@ LIMIT 20
106107
19. │ <dbpedia:Human_Remains_(film)> │ Human Remains (film) │
107108
20. │ <dbpedia:Kazuo_Ishiguro> │ Kazuo Ishiguro │
108109
└───────────────────────────────────────────┴─────────────────────────────────┘
110+
#highlight-next-line
109111
20 rows in set. Elapsed: 0.261 sec. Processed 1.00 million rows, 6.22 GB (3.84 million rows/s., 23.81 GB/s.)
110112
```
111113

112114
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
113115
Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute
114116
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)
115117

116-
## Build Vector Similarity Index {#build-vector-similarity-index}
118+
## Build a vector similarity index {#build-vector-similarity-index}
117119

118-
Run the following SQL to define and build a vector similarity index on the `vector` column :
120+
Run the following SQL to define and build a vector similarity index on the `vector` column:
119121

120122
```sql
121123
ALTER TABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512);
@@ -132,7 +134,7 @@ Building and saving the index could take a few minutes depending on number of CP
132134

133135
_Approximate Nearest Neighbours_ or ANN refers to group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
134136

135-
Once the vector similarity index has been built, vector search queries will automatically use the index :
137+
Once the vector similarity index has been built, vector search queries will automatically use the index:
136138

137139
```sql
138140
SELECT

0 commit comments

Comments
 (0)