Skip to content

Commit 1b48773

Browse files
authored
Merge pull request #4305 from shankar-iyer/add_dbpedia_dataset
Add dbpedia dataset to examples
2 parents 13bfedd + 25dd5ac commit 1b48773

File tree

1 file changed

+325
-0
lines changed

1 file changed

+325
-0
lines changed
Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
---
2+
description: 'Dataset containing 1 million articles from Wikipedia and their vector embeddings'
3+
sidebar_label: 'dbpedia dataset'
4+
slug: /getting-started/example-datasets/dbpedia-dataset
5+
title: 'dbpedia dataset'
6+
keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings']
7+
---
8+
9+
The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using the [text-embedding-3-large](https://platform.openai.com/docs/models/text-embedding-3-large) model from OpenAI.
10+
11+
The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q&A application.
12+
13+
## Dataset details {#dataset-details}
14+
15+
The dataset contains 26 `Parquet` files located on [huggingface.co](https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/). The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit this [Hugging Face page](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M).
16+
17+
## Create table {#create-table}
18+
19+
Create the `dbpedia` table to store the article id, title, text and embedding vector:
20+
21+
```sql
22+
CREATE TABLE dbpedia
23+
(
24+
id String,
25+
title String,
26+
text String,
27+
vector Array(Float32) CODEC(NONE)
28+
) ENGINE = MergeTree ORDER BY (id);
29+
30+
```
31+
32+
## Load table {#load-table}
33+
34+
To load the dataset from all Parquet files, run the following shell command:
35+
36+
```shell
37+
$ seq 0 25 | xargs -P1 -I{} clickhouse client -q "INSERT INTO dbpedia SELECT _id, title, text, \"text-embedding-3-large-1536-embedding\" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/{}.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;"
38+
```
39+
40+
Alternatively, individual SQL statements can be run as shown below to load each of the 25 Parquet files:
41+
42+
```sql
43+
INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/0.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;
44+
INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/1.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;
45+
...
46+
INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/25.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;
47+
48+
```
49+
50+
Verify that 1 million rows are seen in the `dbpedia` table:
51+
52+
```sql
53+
SELECT count(*)
54+
FROM dbpedia
55+
56+
┌─count()─┐
57+
1. │ 1000000
58+
└─────────┘
59+
```
60+
61+
## Semantic search {#semantic-search}
62+
63+
Recommended reading: ["Vector embeddings
64+
" OpenAPI guide](https://platform.openai.com/docs/guides/embeddings)
65+
66+
Semantic search (also referred to as _similarity search_) using vector embeddings involves
67+
the following steps:
68+
69+
- Accept a search query from a user in natural language e.g _"Tell me about some scenic rail journeys”_, _“Suspense novels set in Europe”_ etc
70+
- Generate embedding vector for the search query using the LLM model
71+
- Find nearest neighbours to the search embedding vector in the dataset
72+
73+
The _nearest neighbours_ are documents, images or content that are results relevant to the user query.
74+
The retrieved results are the key input to Retrieval Augmented Generation (RAG) in Generative AI applications.
75+
76+
## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
77+
78+
KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset
79+
to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpedia` dataset,
80+
a quick technique to visually observe semantic search is to use embedding vectors from the dataset itself as search
81+
vectors. For example:
82+
83+
```sql title="Query"
84+
SELECT id, title
85+
FROM dbpedia
86+
ORDER BY cosineDistance(vector, ( SELECT vector FROM dbpedia WHERE id = '<dbpedia:The_Remains_of_the_Day>') ) ASC
87+
LIMIT 20
88+
```
89+
90+
```response title="Response"
91+
┌─id────────────────────────────────────────┬─title───────────────────────────┐
92+
1. │ <dbpedia:The_Remains_of_the_Day> │ The Remains of the Day │
93+
2. │ <dbpedia:The_Remains_of_the_Day_(film)> │ The Remains of the Day (film) │
94+
3. │ <dbpedia:Never_Let_Me_Go_(novel)> │ Never Let Me Go (novel) │
95+
4. │ <dbpedia:Last_Orders> │ Last Orders │
96+
5. │ <dbpedia:The_Unconsoled> │ The Unconsoled │
97+
6. │ <dbpedia:The_Hours_(novel)> │ The Hours (novel) │
98+
7. │ <dbpedia:An_Artist_of_the_Floating_World> │ An Artist of the Floating World │
99+
8. │ <dbpedia:Heat_and_Dust> │ Heat and Dust │
100+
9. │ <dbpedia:A_Pale_View_of_Hills> │ A Pale View of Hills │
101+
10. │ <dbpedia:Howards_End_(film)> │ Howards End (film) │
102+
11. │ <dbpedia:When_We_Were_Orphans> │ When We Were Orphans │
103+
12. │ <dbpedia:A_Passage_to_India_(film)> │ A Passage to India (film) │
104+
13. │ <dbpedia:Memoirs_of_a_Survivor> │ Memoirs of a Survivor │
105+
14. │ <dbpedia:The_Child_in_Time> │ The Child in Time │
106+
15. │ <dbpedia:The_Sea,_the_Sea> │ The Sea, the Sea │
107+
16. │ <dbpedia:The_Master_(novel)> │ The Master (novel) │
108+
17. │ <dbpedia:The_Memorial> │ The Memorial │
109+
18. │ <dbpedia:The_Hours_(film)> │ The Hours (film) │
110+
19. │ <dbpedia:Human_Remains_(film)> │ Human Remains (film) │
111+
20. │ <dbpedia:Kazuo_Ishiguro> │ Kazuo Ishiguro │
112+
└───────────────────────────────────────────┴─────────────────────────────────┘
113+
#highlight-next-line
114+
20 rows in set. Elapsed: 0.261 sec. Processed 1.00 million rows, 6.22 GB (3.84 million rows/s., 23.81 GB/s.)
115+
```
116+
117+
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
118+
Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute
119+
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)
120+
121+
## Build a vector similarity index {#build-vector-similarity-index}
122+
123+
Run the following SQL to define and build a vector similarity index on the `vector` column:
124+
125+
```sql
126+
ALTER TABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512);
127+
128+
ALTER TABLE dbpedia MATERIALIZE INDEX vector_index SETTINGS mutations_sync = 2;
129+
```
130+
131+
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
132+
133+
Building and saving the index could take a few minutes depending on number of CPU cores available and the storage bandwidth.
134+
135+
## Perform ANN search {#perform-ann-search}
136+
137+
_Approximate Nearest Neighbours_ or ANN refers to group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
138+
139+
Once the vector similarity index has been built, vector search queries will automatically use the index:
140+
141+
```sql title="Query"
142+
SELECT
143+
id,
144+
title
145+
FROM dbpedia
146+
ORDER BY cosineDistance(vector, (
147+
SELECT vector
148+
FROM dbpedia
149+
WHERE id = '<dbpedia:Glacier_Express>'
150+
)) ASC
151+
LIMIT 20
152+
```
153+
154+
```response title="Response"
155+
┌─id──────────────────────────────────────────────┬─title─────────────────────────────────┐
156+
1. │ <dbpedia:Glacier_Express> │ Glacier Express │
157+
2. │ <dbpedia:BVZ_Zermatt-Bahn> │ BVZ Zermatt-Bahn │
158+
3. │ <dbpedia:Gornergrat_railway> │ Gornergrat railway │
159+
4. │ <dbpedia:RegioExpress> │ RegioExpress │
160+
5. │ <dbpedia:Matterhorn_Gotthard_Bahn> │ Matterhorn Gotthard Bahn │
161+
6. │ <dbpedia:Rhaetian_Railway> │ Rhaetian Railway │
162+
7. │ <dbpedia:Gotthard_railway> │ Gotthard railway │
163+
8. │ <dbpedia:Furka–Oberalp_railway> │ Furka–Oberalp railway │
164+
9. │ <dbpedia:Jungfrau_railway> │ Jungfrau railway │
165+
10. │ <dbpedia:Monte_Generoso_railway> │ Monte Generoso railway │
166+
11. │ <dbpedia:Montreux–Oberland_Bernois_railway> │ Montreux–Oberland Bernois railway │
167+
12. │ <dbpedia:Brienz–Rothorn_railway> │ Brienz–Rothorn railway │
168+
13. │ <dbpedia:Lauterbrunnen–Mürren_mountain_railway> │ Lauterbrunnen–Mürren mountain railway │
169+
14. │ <dbpedia:Luzern–Stans–Engelberg_railway_line> │ Luzern–Stans–Engelberg railway line │
170+
15. │ <dbpedia:Rigi_Railways> │ Rigi Railways │
171+
16. │ <dbpedia:Saint-Gervais–Vallorcine_railway> │ Saint-Gervais–Vallorcine railway │
172+
17. │ <dbpedia:Gatwick_Express> │ Gatwick Express │
173+
18. │ <dbpedia:Brünig_railway_line> │ Brünig railway line │
174+
19. │ <dbpedia:Regional-Express> │ Regional-Express │
175+
20. │ <dbpedia:Schynige_Platte_railway> │ Schynige Platte railway │
176+
└─────────────────────────────────────────────────┴───────────────────────────────────────┘
177+
#highlight-next-line
178+
20 rows in set. Elapsed: 0.025 sec. Processed 32.03 thousand rows, 2.10 MB (1.29 million rows/s., 84.80 MB/s.)
179+
```
180+
181+
## Generating embeddings for search query {#generating-embeddings-for-search-query}
182+
183+
The similarity search queries seen until now use one of the existing vectors in the `dbpedia`
184+
table as the search vector. In real world applications, the search vector has to be
185+
generated for a user input query which could be in natural language. The search vector
186+
should be generated by using the same LLM model used to generate embedding vectors
187+
for the dataset.
188+
189+
An example Python script is listed below to demonstrate how to programmatically call OpenAI API's to
190+
generate embedding vectors using the `text-embedding-3-large` model. The search embedding vector
191+
is then passed as an argument to the `cosineDistance()` function in the `SELECT` query.
192+
193+
Running the script requires an OpenAI API key to be set in the environment variable `OPENAI_API_KEY`.
194+
The OpenAI API key can be obtained after registering at https://platform.openai.com.
195+
196+
```python
197+
import sys
198+
from openai import OpenAI
199+
import clickhouse_connect
200+
201+
ch_client = clickhouse_connect.get_client(compress=False) # Pass ClickHouse credentials
202+
openai_client = OpenAI() # Set OPENAI_API_KEY environment variable
203+
204+
def get_embedding(text, model):
205+
text = text.replace("\n", " ")
206+
return openai_client.embeddings.create(input = [text], model=model, dimensions=1536).data[0].embedding
207+
208+
209+
while True:
210+
# Accept the search query from user
211+
print("Enter a search query :")
212+
input_query = sys.stdin.readline();
213+
214+
# Call OpenAI API endpoint to get the embedding
215+
print("Generating the embedding for ", input_query);
216+
embedding = get_embedding(input_query,
217+
model='text-embedding-3-large')
218+
219+
# Execute vector search query in ClickHouse
220+
print("Querying clickhouse...")
221+
params = {'v1':embedding, 'v2':10}
222+
result = ch_client.query("SELECT id,title,text FROM dbpedia ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)
223+
224+
for row in result.result_rows:
225+
print(row[0], row[1], row[2])
226+
print("---------------")
227+
```
228+
229+
## Q&A demo application {#q-and-a-demo-application}
230+
231+
The examples above demonstrated semantic search and document retrieval using ClickHouse. A very simple but high potential generative AI example application is presented next.
232+
233+
The application performs the following steps:
234+
235+
1. Accepts a _topic_ as input from the user
236+
2. Generates an embedding vector for the _topic_ by invoking OpenAI API with model `text-embedding-3-large`
237+
3. Retrieves highly relevant Wikipedia articles/documents using vector similarity search on the `dbpedia` table
238+
4. Accepts a free-form question in natural language from the user relating to the _topic_
239+
5. Uses the OpenAI `gpt-3.5-turbo` Chat API to answer the question based on the knowledge in the documents retrieved in step #3.
240+
The documents retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI.
241+
242+
A couple of conversation examples by running the Q&A application are first listed below, followed by the code
243+
for the Q&A application. Running the application requires an OpenAI API key to be set in the environment
244+
variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openai.com.
245+
246+
```shell
247+
$ python3 QandA.py
248+
249+
Enter a topic : FIFA world cup 1990
250+
Generating the embedding for 'FIFA world cup 1990' and collecting 100 articles related to it from ClickHouse...
251+
252+
Enter your question : Who won the golden boot
253+
Salvatore Schillaci of Italy won the Golden Boot at the 1990 FIFA World Cup.
254+
255+
256+
Enter a topic : Cricket world cup
257+
Generating the embedding for 'Cricket world cup' and collecting 100 articles related to it from ClickHouse...
258+
259+
Enter your question : Which country has hosted the world cup most times
260+
England and Wales have hosted the Cricket World Cup the most times, with the tournament being held in these countries five times - in 1975, 1979, 1983, 1999, and 2019.
261+
262+
$
263+
```
264+
265+
Code:
266+
267+
```Python
268+
import sys
269+
import time
270+
from openai import OpenAI
271+
import clickhouse_connect
272+
273+
ch_client = clickhouse_connect.get_client(compress=False) # Pass ClickHouse credentials here
274+
openai_client = OpenAI() # Set the OPENAI_API_KEY environment variable
275+
276+
def get_embedding(text, model):
277+
text = text.replace("\n", " ")
278+
return openai_client.embeddings.create(input = [text], model=model, dimensions=1536).data[0].embedding
279+
280+
while True:
281+
# Take the topic of interest from user
282+
print("Enter a topic : ", end="", flush=True)
283+
input_query = sys.stdin.readline()
284+
input_query = input_query.rstrip()
285+
286+
# Generate an embedding vector for the search topic and query ClickHouse
287+
print("Generating the embedding for '" + input_query + "' and collecting 100 articles related to it from ClickHouse...");
288+
embedding = get_embedding(input_query,
289+
model='text-embedding-3-large')
290+
291+
params = {'v1':embedding, 'v2':100}
292+
result = ch_client.query("SELECT id,title,text FROM dbpedia ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)
293+
294+
# Collect all the matching articles/documents
295+
results = ""
296+
for row in result.result_rows:
297+
results = results + row[2]
298+
299+
print("\nEnter your question : ", end="", flush=True)
300+
question = sys.stdin.readline();
301+
302+
# Prompt for the OpenAI Chat API
303+
query = f"""Use the below content to answer the subsequent question. If the answer cannot be found, write "I don't know."
304+
305+
Content:
306+
\"\"\"
307+
{results}
308+
\"\"\"
309+
310+
Question: {question}"""
311+
312+
GPT_MODEL = "gpt-3.5-turbo"
313+
response = openai_client.chat.completions.create(
314+
messages=[
315+
{'role': 'system', 'content': "You answer questions about {input_query}."},
316+
{'role': 'user', 'content': query},
317+
],
318+
model=GPT_MODEL,
319+
temperature=0,
320+
)
321+
322+
# Print the answer to the question!
323+
print(response.choices[0].message.content)
324+
print("\n")
325+
```

0 commit comments

Comments
 (0)