Skip to content

Commit 12617ac

Browse files
committed
Review
1 parent 36eacc5 commit 12617ac

File tree

1 file changed

+131
-10
lines changed

1 file changed

+131
-10
lines changed

docs/getting-started/example-datasets/dbpedia.md

Lines changed: 131 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The dataset is an excellent starter dataset to understand vector embeddings, vec
1111

1212
## Dataset details {#dataset-details}
1313

14-
The dataset consists of 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M.
14+
The dataset contains 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M.
1515

1616
## Create table {#create-table}
1717

@@ -30,12 +30,13 @@ CREATE TABLE dbpedia
3030

3131
## Load table {#load-table}
3232

33-
To load the dataset from the Parquet files, run the following shell command :
33+
To load the dataset from all Parquet files, run the following shell command :
3434

3535
```shell
36+
$ seq 0 25 | xargs -P1 -I{} clickhouse client -q "INSERT INTO dbpedia SELECT _id, title, text, \"text-embedding-3-large-1536-embedding\" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/{}.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;"
3637
```
3738

38-
Alternatively, individual SQL statements can be run as shown below for each of the 25 Parquet files :
39+
Alternatively, individual SQL statements can be run as shown below to load each of the 25 Parquet files :
3940

4041
```sql
4142
INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/0.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;
@@ -45,22 +46,35 @@ INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedd
4546

4647
```
4748

48-
## Semantic Search
49+
Verify that 1 million rows are seen in the `dbpedia` table :
50+
51+
```sql
52+
SELECT count(*)
53+
FROM dbpedia
54+
55+
┌─count()─┐
56+
1. │ 1000000
57+
└─────────┘
58+
```
59+
60+
## Semantic Search {#semantic-search}
61+
4962
Recommended reading : https://platform.openai.com/docs/guides/embeddings
5063

51-
Semantic search (or referred to as _similarity search_) using vector embeddings involes
64+
Semantic search (or referred to as _similarity search_) using vector embeddings involves
5265
the following steps :
5366

54-
- Accept a search query from user in natural language e.g Tell me some scenic rail journeys”, “Suspense novels set in Europe” etc
67+
- Accept a search query from user in natural language e.g _"Tell me some scenic rail journeys”_, _“Suspense novels set in Europe”_ etc
5568
- Generate embedding vector for the search query using the LLM model
5669
- Find nearest neighbours to the search embedding vector in the dataset
5770

5871
The _nearest neighbours_ are documents, images or content that are results relevant to the user query.
59-
The results are the key input to Retrieval Augmented Generation (RAG) in Generative AI applications.
72+
The retrieved results are the key input to Retrieval Augmented Generation (RAG) in Generative AI applications.
73+
74+
## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
6075

61-
## Experiment with KNN Search
6276
KNN (k - Nearest Neighbours) search or brute force search involves calculating distance of each vector in the dataset
63-
to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpediai` dataset,
77+
to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpedia` dataset,
6478
a quick technique to visually observe semantic search is to use embedding vectors from the dataset itself as search
6579
vectors. Example :
6680

@@ -92,14 +106,121 @@ LIMIT 20
92106
19. │ <dbpedia:Human_Remains_(film)> │ Human Remains (film) │
93107
20. │ <dbpedia:Kazuo_Ishiguro> │ Kazuo Ishiguro │
94108
└───────────────────────────────────────────┴─────────────────────────────────┘
109+
20 rows in set. Elapsed: 0.261 sec. Processed 1.00 million rows, 6.22 GB (3.84 million rows/s., 23.81 GB/s.)
95110
```
96111

97112
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
98113
Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute
99114
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)
100115

116+
## Build Vector Similarity Index {#build-vector-similarity-index}
117+
118+
Run the following SQL to define and build a vector similarity index on the `vector` column :
119+
120+
```sql
121+
ALTER TABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512);
122+
101123

124+
ALTER TABLE dbpedia MATERIALIZE INDEX vector_index;
125+
```
126+
127+
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
128+
129+
Building and saving the index could take a few minutes depending on number of CPU cores available and the storage bandwidth.
130+
131+
## Perform ANN search {#perform-ann-search}
132+
133+
_Approximate Nearest Neighbours_ or ANN refers to group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
134+
135+
Once the vector similarity index has been built, vector search queries will automatically use the index :
136+
137+
```sql
138+
SELECT
139+
id,
140+
title
141+
FROM dbpedia
142+
ORDER BY cosineDistance(vector, (
143+
SELECT vector
144+
FROM dbpedia
145+
WHERE id = '<dbpedia:Glacier_Express>'
146+
)) ASC
147+
LIMIT 20
148+
149+
┌─id──────────────────────────────────────────────┬─title─────────────────────────────────┐
150+
1. │ <dbpedia:Glacier_Express> │ Glacier Express │
151+
2. │ <dbpedia:BVZ_Zermatt-Bahn> │ BVZ Zermatt-Bahn │
152+
3. │ <dbpedia:Gornergrat_railway> │ Gornergrat railway │
153+
4. │ <dbpedia:RegioExpress> │ RegioExpress │
154+
5. │ <dbpedia:Matterhorn_Gotthard_Bahn> │ Matterhorn Gotthard Bahn │
155+
6. │ <dbpedia:Rhaetian_Railway> │ Rhaetian Railway │
156+
7. │ <dbpedia:Gotthard_railway> │ Gotthard railway │
157+
8. │ <dbpedia:Furka–Oberalp_railway> │ Furka–Oberalp railway │
158+
9. │ <dbpedia:Jungfrau_railway> │ Jungfrau railway │
159+
10. │ <dbpedia:Monte_Generoso_railway> │ Monte Generoso railway │
160+
11. │ <dbpedia:Montreux–Oberland_Bernois_railway> │ Montreux–Oberland Bernois railway │
161+
12. │ <dbpedia:Brienz–Rothorn_railway> │ Brienz–Rothorn railway │
162+
13. │ <dbpedia:Lauterbrunnen–Mürren_mountain_railway> │ Lauterbrunnen–Mürren mountain railway │
163+
14. │ <dbpedia:Luzern–Stans–Engelberg_railway_line> │ Luzern–Stans–Engelberg railway line
164+
15. │ <dbpedia:Rigi_Railways> │ Rigi Railways │
165+
16. │ <dbpedia:Saint-Gervais–Vallorcine_railway> │ Saint-Gervais–Vallorcine railway │
166+
17. │ <dbpedia:Gatwick_Express> │ Gatwick Express │
167+
18. │ <dbpedia:Brünig_railway_line> │ Brünig railway line
168+
19. │ <dbpedia:Regional-Express> │ Regional-Express │
169+
20. │ <dbpedia:Schynige_Platte_railway> │ Schynige Platte railway │
170+
└─────────────────────────────────────────────────┴───────────────────────────────────────┘
171+
172+
20 rows in set. Elapsed: 0.025 sec. Processed 32.03 thousand rows, 2.10 MB (1.29 million rows/s., 84.80 MB/s.)
173+
```
174+
Compare the latency and I/O resource usage of the above query with the earlier query executed
175+
using brute force KNN.
102176

177+
## Generating embeddings for search query {#generating-embeddings-for-search-query}
178+
179+
The similarity search queries seen until now use one of the existing vectors in the `dbpedia`
180+
table as the search vector. In real world applications, the search vector has to be
181+
generated for a user input query which could be in natural language. The search vector
182+
should be generated by using the same LLM model used to generate embedding vectors
183+
for the dataset.
184+
185+
An example Python script is listed below to demonstrate how to programmatically call OpenAI API's to
186+
generate embedding vectors using the `text-embedding-3-large` model. The search embedding vector
187+
is then passed as an argument to the `cosineDistance()` function in the `SELECT` query.
188+
189+
Running the script requires an OpenAI API key to be set in the environment variable `OPENAI_API_KEY`.
190+
The OpenAI API key can be obtained after registering at https://platform.openai.com.
191+
192+
```python
193+
import sys
194+
from openai import OpenAI
195+
import clickhouse_connect
196+
197+
ch_client = clickhouse_connect.get_client(compress=False) # Pass ClickHouse credentials
198+
openai_client = OpenAI() # Set OPENAI_API_KEY environment variable
199+
200+
def get_embedding(text, model):
201+
text = text.replace("\n", " ")
202+
return openai_client.embeddings.create(input = [text], model=model, dimensions=1536).data[0].embedding
203+
204+
205+
while True:
206+
# Accept the search query from user
207+
print("Enter a search query :")
208+
input_query = sys.stdin.readline();
209+
210+
# Call OpenAI API endpoint to get the embedding
211+
print("Generating the embedding for ", input_query);
212+
embedding = get_embedding(input_query,
213+
model='text-embedding-3-large')
214+
215+
# Execute vector search query in ClickHouse
216+
print("Querying clickhouse...")
217+
params = {'v1':embedding, 'v2':10}
218+
result = ch_client.query("SELECT id,title,text FROM dbpedia ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)
219+
220+
for row in result.result_rows:
221+
print(row[0], row[1], row[2])
222+
print("---------------")
223+
```
103224

104225
## Q & A Demo Application {#q-and-a-demo-application}
105226

@@ -116,7 +237,7 @@ The application performs the following steps :
116237

117238
A couple of conversation examples by running the Q & A application are first listed below, followed by the code
118239
for the Q & A application. Running the application requires an OpenAI API key to be set in the environment
119-
variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openapi.com.
240+
variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openai.com.
120241

121242
```shell
122243
$ python3 QandA.py

0 commit comments

Comments
 (0)