Skip to content

Commit 2f2cd34

Browse files
committed
save
1 parent 2884cc7 commit 2f2cd34

File tree

1 file changed

+59
-78
lines changed

1 file changed

+59
-78
lines changed

docs/getting-started/example-datasets/dbpedia.md

Lines changed: 59 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,11 @@ title: 'dbpedia dataset'
77

88
The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using `text-embedding-3-large` model from OpenAI.
99

10-
The dataset is an excellent starter dataset to understand semantic search, vector embeddings and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple Q & A application.
10+
The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q & A application.
1111

12-
## Data preparation {#data-preparation}
12+
## Dataset details {#dataset-details}
1313

14-
The dataset consists of 26 `Parquet` files located at
15-
converts them to CSV and imports them into ClickHouse. You can use the following `download.sh` script for that:
16-
17-
18-
```bash
19-
seq 0 409 | xargs -P1 -I{} bash -c './download.sh {}'
20-
```
21-
22-
The dataset is split into 410 files, each file contains ca. 1 million rows. If you like to work with a smaller subset of the data, simply adjust the limits, e.g. `seq 0 9 | ...`.
23-
24-
(The python script above is very slow (~2-10 minutes per file), takes a lot of memory (41 GB per file), and the resulting csv files are big (10 GB each), so be careful. If you have enough RAM, increase the `-P1` number for more parallelism. If this is still too slow, consider coming up with a better ingestion procedure - maybe converting the .npy files to parquet, then doing all the other processing with clickhouse.)
14+
The dataset consists of 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M.
2515

2616
## Create table {#create-table}
2717

@@ -38,87 +28,78 @@ CREATE TABLE dbpedia
3828

3929
```
4030

41-
To load the dataset from the Parquet files,
31+
## Load table {#load-table}
4232

43-
```sql
44-
INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv'
45-
```
33+
To load the dataset from the Parquet files, run the following shell command :
4634

47-
## Run a brute-force ANN search (without ANN index) {#run-a-brute-force-ann-search-without-ann-index}
35+
```shell
36+
```
4837

49-
To run a brute-force approximate nearest neighbor search, run:
38+
Alternatively, individual SQL statements can be run as shown below for each of the 25 Parquet files :
5039

5140
```sql
52-
SELECT url, caption FROM laion ORDER BY L2Distance(image_embedding, {target:Array(Float32)}) LIMIT 30
53-
```
41+
INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/0.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;
42+
INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/1.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;
43+
...
44+
INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/25.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;
5445

55-
`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random cat picture as `target`.
46+
```
5647

57-
**Result**
48+
## Semantic Search
49+
Recommended reading : https://platform.openai.com/docs/guides/embeddings
5850

59-
```markdown
60-
┌─url───────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption────────────────────────────────────────────────────────────────┐
61-
│ https://s3.amazonaws.com/filestore.rescuegroups.org/6685/pictures/animals/13884/13884995/63318230_463x463.jpg │ Adoptable Female Domestic Short Hair │
62-
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/6/239905226.jpg │ Adopt A Pet :: Marzipan - New York, NY │
63-
│ http://d1n3ar4lqtlydb.cloudfront.net/9/2/4/248407625.jpg │ Adopt A Pet :: Butterscotch - New Castle, DE │
64-
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/e/e/c/245615237.jpg │ Adopt A Pet :: Tiggy - Chicago, IL │
65-
│ http://pawsofcoronado.org/wp-content/uploads/2012/12/rsz_pumpkin.jpg │ Pumpkin an orange tabby kitten for adoption │
66-
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/7/8/3/188700997.jpg │ Adopt A Pet :: Brian the Brad Pitt of cats - Frankfort, IL │
67-
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/d/191533561.jpg │ Domestic Shorthair Cat for adoption in Mesa, Arizona - Charlie │
68-
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/0/1/2/221698235.jpg │ Domestic Shorthair Cat for adoption in Marietta, Ohio - Daisy (Spayed) │
69-
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘
51+
Semantic search (or referred to as _similarity search_) using vector embeddings involes
52+
the following steps :
7053

71-
8 rows in set. Elapsed: 6.432 sec. Processed 19.65 million rows, 43.96 GB (3.06 million rows/s., 6.84 GB/s.)
72-
```
54+
- Accept a search query from user in natural language e.g ‘Tell me some scenic rail journeys”, “Suspense novels set in Europe” etc
55+
- Generate embedding vector for the search query using the LLM model
56+
- Find nearest neighbours to the search embedding vector in the dataset
7357

74-
## Run a ANN with an ANN index {#run-a-ann-with-an-ann-index}
58+
The _nearest neighbours_ are documents, images or content that are results relevant to the user query.
59+
The results are the key input to Retrieval Augmented Generation (RAG) in Generative AI applications.
7560

76-
Create a new table with an ANN index and insert the data from the existing table:
61+
## Experiment with KNN Search
62+
KNN (k - Nearest Neighbours) search or brute force search involves calculating distance of each vector in the dataset
63+
to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpediai` dataset,
64+
a quick technique to visually observe semantic search is to use embedding vectors from the dataset itself as search
65+
vectors. Example :
7766

7867
```sql
79-
CREATE TABLE laion_annoy
80-
(
81-
`id` Int64,
82-
`url` String,
83-
`caption` String,
84-
`NSFW` String,
85-
`similarity` Float32,
86-
`image_embedding` Array(Float32),
87-
`text_embedding` Array(Float32),
88-
INDEX annoy_image image_embedding TYPE annoy(),
89-
INDEX annoy_text text_embedding TYPE annoy()
90-
)
91-
ENGINE = MergeTree
92-
ORDER BY id
93-
SETTINGS index_granularity = 8192;
94-
95-
INSERT INTO laion_annoy SELECT * FROM laion;
68+
SELECT id, title
69+
FROM dbpedia
70+
ORDER BY cosineDistance(vector, ( SELECT vector FROM dbpedia WHERE id = '<dbpedia:The_Remains_of_the_Day>') ) ASC
71+
LIMIT 20
72+
73+
┌─id────────────────────────────────────────┬─title───────────────────────────┐
74+
1. │ <dbpedia:The_Remains_of_the_Day> │ The Remains of the Day │
75+
2. │ <dbpedia:The_Remains_of_the_Day_(film)> │ The Remains of the Day (film) │
76+
3. │ <dbpedia:Never_Let_Me_Go_(novel)> │ Never Let Me Go (novel) │
77+
4. │ <dbpedia:Last_Orders> │ Last Orders │
78+
5. │ <dbpedia:The_Unconsoled> │ The Unconsoled │
79+
6. │ <dbpedia:The_Hours_(novel)> │ The Hours (novel) │
80+
7. │ <dbpedia:An_Artist_of_the_Floating_World> │ An Artist of the Floating World │
81+
8. │ <dbpedia:Heat_and_Dust> │ Heat and Dust │
82+
9. │ <dbpedia:A_Pale_View_of_Hills> │ A Pale View of Hills │
83+
10. │ <dbpedia:Howards_End_(film)> │ Howards End (film) │
84+
11. │ <dbpedia:When_We_Were_Orphans> │ When We Were Orphans │
85+
12. │ <dbpedia:A_Passage_to_India_(film)> │ A Passage to India (film) │
86+
13. │ <dbpedia:Memoirs_of_a_Survivor> │ Memoirs of a Survivor │
87+
14. │ <dbpedia:The_Child_in_Time> │ The Child in Time
88+
15. │ <dbpedia:The_Sea,_the_Sea> │ The Sea, the Sea │
89+
16. │ <dbpedia:The_Master_(novel)> │ The Master (novel) │
90+
17. │ <dbpedia:The_Memorial> │ The Memorial │
91+
18. │ <dbpedia:The_Hours_(film)> │ The Hours (film) │
92+
19. │ <dbpedia:Human_Remains_(film)> │ Human Remains (film) │
93+
20. │ <dbpedia:Kazuo_Ishiguro> │ Kazuo Ishiguro │
94+
└───────────────────────────────────────────┴─────────────────────────────────┘
9695
```
9796

98-
By default, Annoy indexes use the L2 distance as metric. Further tuning knobs for index creation and search are described in the Annoy index [documentation](../../engines/table-engines/mergetree-family/annindexes.md). Let's check now again with the same query:
97+
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
98+
Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute
99+
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)
99100

100-
```sql
101-
SELECT url, caption FROM laion_annoy ORDER BY l2Distance(image_embedding, {target:Array(Float32)}) LIMIT 8
102-
```
103101

104-
**Result**
105-
106-
```response
107-
┌─url──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption──────────────────────────────────────────────────────────────┐
108-
│ http://tse1.mm.bing.net/th?id=OIP.R1CUoYp_4hbeFSHBaaB5-gHaFj │ bed bugs and pets can cats carry bed bugs pets adviser │
109-
│ http://pet-uploads.adoptapet.com/1/9/c/1963194.jpg?336w │ Domestic Longhair Cat for adoption in Quincy, Massachusetts - Ashley │
110-
│ https://thumbs.dreamstime.com/t/cat-bed-12591021.jpg │ Cat on bed Stock Image │
111-
│ https://us.123rf.com/450wm/penta/penta1105/penta110500004/9658511-portrait-of-british-short-hair-kitten-lieing-at-sofa-on-sun.jpg │ Portrait of british short hair kitten lieing at sofa on sun. │
112-
│ https://www.easypetmd.com/sites/default/files/Wirehaired%20Vizsla%20(2).jpg │ Vizsla (Wirehaired) image 3 │
113-
│ https://images.ctfassets.net/yixw23k2v6vo/0000000200009b8800000000/7950f4e1c1db335ef91bb2bc34428de9/dog-cat-flickr-Impatience_1.jpg?w=600&h=400&fm=jpg&fit=thumb&q=65&fl=progressive │ dog and cat image │
114-
│ https://i1.wallbox.ru/wallpapers/small/201523/eaa582ee76a31fd.jpg │ cats, kittens, faces, tonkinese │
115-
│ https://www.baxterboo.com/images/breeds/medium/cairn-terrier.jpg │ Cairn Terrier Photo │
116-
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────┘
117-
118-
8 rows in set. Elapsed: 0.641 sec. Processed 22.06 thousand rows, 49.36 MB (91.53 thousand rows/s., 204.81 MB/s.)
119-
```
120102

121-
The speed increased significantly at the cost of less accurate results. This is because the ANN index only provide approximate search results. Note the example searched for similar image embeddings, yet it is also possible to search for positive image caption embeddings.
122103

123104
## Q & A Demo Application {#q-and-a-demo-application}
124105

@@ -133,9 +114,9 @@ The application performs the following steps :
133114
5. Uses the OpenAI `gpt-3.5-turbo` Chat API to answer the question based on the knowledge in the documents retrieved in step #3.
134115
The documents retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI.
135116

136-
First a couple of conversation examples by running the Q & A application are listed below, followed by the code
117+
A couple of conversation examples by running the Q & A application are first listed below, followed by the code
137118
for the Q & A application. Running the application requires an OpenAI API key to be set in the environment
138-
variable `OPENAI_API_KEY`.
119+
variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openapi.com.
139120

140121
```shell
141122
$ python3 QandA.py

0 commit comments

Comments
 (0)