You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/getting-started/example-datasets/dbpedia.md
+59-78Lines changed: 59 additions & 78 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,21 +7,11 @@ title: 'dbpedia dataset'
7
7
8
8
The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using `text-embedding-3-large` model from OpenAI.
9
9
10
-
The dataset is an excellent starter dataset to understand semantic search, vector embeddings and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple Q & A application.
10
+
The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q & A application.
11
11
12
-
## Data preparation {#data-preparation}
12
+
## Dataset details {#dataset-details}
13
13
14
-
The dataset consists of 26 `Parquet` files located at
15
-
converts them to CSV and imports them into ClickHouse. You can use the following `download.sh` script for that:
The dataset is split into 410 files, each file contains ca. 1 million rows. If you like to work with a smaller subset of the data, simply adjust the limits, e.g. `seq 0 9 | ...`.
23
-
24
-
(The python script above is very slow (~2-10 minutes per file), takes a lot of memory (41 GB per file), and the resulting csv files are big (10 GB each), so be careful. If you have enough RAM, increase the `-P1` number for more parallelism. If this is still too slow, consider coming up with a better ingestion procedure - maybe converting the .npy files to parquet, then doing all the other processing with clickhouse.)
14
+
The dataset consists of 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M.
25
15
26
16
## Create table {#create-table}
27
17
@@ -38,87 +28,78 @@ CREATE TABLE dbpedia
38
28
39
29
```
40
30
41
-
To load the dataset from the Parquet files,
31
+
## Load table {#load-table}
42
32
43
-
```sql
44
-
INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv'
45
-
```
33
+
To load the dataset from the Parquet files, run the following shell command :
46
34
47
-
## Run a brute-force ANN search (without ANN index) {#run-a-brute-force-ann-search-without-ann-index}
35
+
```shell
36
+
```
48
37
49
-
To run a brute-force approximate nearest neighbor search, run:
38
+
Alternatively, individual SQL statements can be run as shown below for each of the 25 Parquet files :
50
39
51
40
```sql
52
-
SELECT url, caption FROM laion ORDER BY L2Distance(image_embedding, {target:Array(Float32)}) LIMIT30
`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random cat picture as `target`.
By default, Annoy indexes use the L2 distance as metric. Further tuning knobs for index creation and search are described in the Annoy index [documentation](../../engines/table-engines/mergetree-family/annindexes.md). Let's check now again with the same query:
97
+
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
98
+
Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute
99
+
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)
99
100
100
-
```sql
101
-
SELECT url, caption FROM laion_annoy ORDER BY l2Distance(image_embedding, {target:Array(Float32)}) LIMIT8
│ http://tse1.mm.bing.net/th?id=OIP.R1CUoYp_4hbeFSHBaaB5-gHaFj │ bed bugs and pets can cats carry bed bugs pets adviser │
109
-
│ http://pet-uploads.adoptapet.com/1/9/c/1963194.jpg?336w │ Domestic Longhair Cat for adoption in Quincy, Massachusetts - Ashley │
110
-
│ https://thumbs.dreamstime.com/t/cat-bed-12591021.jpg │ Cat on bed Stock Image │
111
-
│ https://us.123rf.com/450wm/penta/penta1105/penta110500004/9658511-portrait-of-british-short-hair-kitten-lieing-at-sofa-on-sun.jpg │ Portrait of british short hair kitten lieing at sofa on sun. │
│ https://images.ctfassets.net/yixw23k2v6vo/0000000200009b8800000000/7950f4e1c1db335ef91bb2bc34428de9/dog-cat-flickr-Impatience_1.jpg?w=600&h=400&fm=jpg&fit=thumb&q=65&fl=progressive │ dog and cat image │
8 rows in set. Elapsed: 0.641 sec. Processed 22.06 thousand rows, 49.36 MB (91.53 thousand rows/s., 204.81 MB/s.)
119
-
```
120
102
121
-
The speed increased significantly at the cost of less accurate results. This is because the ANN index only provide approximate search results. Note the example searched for similar image embeddings, yet it is also possible to search for positive image caption embeddings.
122
103
123
104
## Q & A Demo Application {#q-and-a-demo-application}
124
105
@@ -133,9 +114,9 @@ The application performs the following steps :
133
114
5. Uses the OpenAI `gpt-3.5-turbo` Chat API to answer the question based on the knowledge in the documents retrieved in step #3.
134
115
The documents retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI.
135
116
136
-
First a couple of conversation examples by running the Q & A application are listed below, followed by the code
117
+
A couple of conversation examples by running the Q & A application are first listed below, followed by the code
137
118
for the Q & A application. Running the application requires an OpenAI API key to be set in the environment
138
-
variable `OPENAI_API_KEY`.
119
+
variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openapi.com.
0 commit comments