|
| 1 | +--- |
| 2 | +description: 'Dataset containing 1 million articles from Wikipedia and their vector embeddings" |
| 3 | +sidebar_label: 'dbpedia dataset' |
| 4 | +slug: /getting-started/example-datasets/dbpedia-dataset |
| 5 | +title: 'dbpedia dataset' |
| 6 | +--- |
| 7 | + |
| 8 | +The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using `text-embedding-3-large` model from OpenAI. |
| 9 | + |
| 10 | +The dataset is an excellent starter dataset to understand semantic search, vector embeddings and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple Q & A application. |
| 11 | + |
| 12 | +## Data preparation {#data-preparation} |
| 13 | + |
| 14 | +The dataset consists of 26 `Parquet` files located at |
| 15 | +converts them to CSV and imports them into ClickHouse. You can use the following `download.sh` script for that: |
| 16 | + |
| 17 | + |
| 18 | +```bash |
| 19 | +seq 0 409 | xargs -P1 -I{} bash -c './download.sh {}' |
| 20 | +``` |
| 21 | + |
| 22 | +The dataset is split into 410 files, each file contains ca. 1 million rows. If you like to work with a smaller subset of the data, simply adjust the limits, e.g. `seq 0 9 | ...`. |
| 23 | + |
| 24 | +(The python script above is very slow (~2-10 minutes per file), takes a lot of memory (41 GB per file), and the resulting csv files are big (10 GB each), so be careful. If you have enough RAM, increase the `-P1` number for more parallelism. If this is still too slow, consider coming up with a better ingestion procedure - maybe converting the .npy files to parquet, then doing all the other processing with clickhouse.) |
| 25 | + |
| 26 | +## Create table {#create-table} |
| 27 | + |
| 28 | +Create the `dbpedia` table to store the article id, title, text and embedding vector : |
| 29 | + |
| 30 | +```sql |
| 31 | +CREATE TABLE dbpedia |
| 32 | +( |
| 33 | + id String, |
| 34 | + title String, |
| 35 | + text String, |
| 36 | + vector Array(Float32) CODEC(NONE) |
| 37 | +) ENGINE = MergeTree ORDER BY (id); |
| 38 | + |
| 39 | +``` |
| 40 | + |
| 41 | +To load the dataset from the Parquet files, |
| 42 | + |
| 43 | +```sql |
| 44 | +INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv' |
| 45 | +``` |
| 46 | + |
| 47 | +## Run a brute-force ANN search (without ANN index) {#run-a-brute-force-ann-search-without-ann-index} |
| 48 | + |
| 49 | +To run a brute-force approximate nearest neighbor search, run: |
| 50 | + |
| 51 | +```sql |
| 52 | +SELECT url, caption FROM laion ORDER BY L2Distance(image_embedding, {target:Array(Float32)}) LIMIT 30 |
| 53 | +``` |
| 54 | + |
| 55 | +`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random cat picture as `target`. |
| 56 | + |
| 57 | +**Result** |
| 58 | + |
| 59 | +```markdown |
| 60 | +┌─url───────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption────────────────────────────────────────────────────────────────┐ |
| 61 | +│ https://s3.amazonaws.com/filestore.rescuegroups.org/6685/pictures/animals/13884/13884995/63318230_463x463.jpg │ Adoptable Female Domestic Short Hair │ |
| 62 | +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/6/239905226.jpg │ Adopt A Pet :: Marzipan - New York, NY │ |
| 63 | +│ http://d1n3ar4lqtlydb.cloudfront.net/9/2/4/248407625.jpg │ Adopt A Pet :: Butterscotch - New Castle, DE │ |
| 64 | +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/e/e/c/245615237.jpg │ Adopt A Pet :: Tiggy - Chicago, IL │ |
| 65 | +│ http://pawsofcoronado.org/wp-content/uploads/2012/12/rsz_pumpkin.jpg │ Pumpkin an orange tabby kitten for adoption │ |
| 66 | +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/7/8/3/188700997.jpg │ Adopt A Pet :: Brian the Brad Pitt of cats - Frankfort, IL │ |
| 67 | +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/d/191533561.jpg │ Domestic Shorthair Cat for adoption in Mesa, Arizona - Charlie │ |
| 68 | +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/0/1/2/221698235.jpg │ Domestic Shorthair Cat for adoption in Marietta, Ohio - Daisy (Spayed) │ |
| 69 | +└───────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘ |
| 70 | + |
| 71 | +8 rows in set. Elapsed: 6.432 sec. Processed 19.65 million rows, 43.96 GB (3.06 million rows/s., 6.84 GB/s.) |
| 72 | +``` |
| 73 | + |
| 74 | +## Run a ANN with an ANN index {#run-a-ann-with-an-ann-index} |
| 75 | + |
| 76 | +Create a new table with an ANN index and insert the data from the existing table: |
| 77 | + |
| 78 | +```sql |
| 79 | +CREATE TABLE laion_annoy |
| 80 | +( |
| 81 | + `id` Int64, |
| 82 | + `url` String, |
| 83 | + `caption` String, |
| 84 | + `NSFW` String, |
| 85 | + `similarity` Float32, |
| 86 | + `image_embedding` Array(Float32), |
| 87 | + `text_embedding` Array(Float32), |
| 88 | + INDEX annoy_image image_embedding TYPE annoy(), |
| 89 | + INDEX annoy_text text_embedding TYPE annoy() |
| 90 | +) |
| 91 | +ENGINE = MergeTree |
| 92 | +ORDER BY id |
| 93 | +SETTINGS index_granularity = 8192; |
| 94 | + |
| 95 | +INSERT INTO laion_annoy SELECT * FROM laion; |
| 96 | +``` |
| 97 | + |
| 98 | +By default, Annoy indexes use the L2 distance as metric. Further tuning knobs for index creation and search are described in the Annoy index [documentation](../../engines/table-engines/mergetree-family/annindexes.md). Let's check now again with the same query: |
| 99 | + |
| 100 | +```sql |
| 101 | +SELECT url, caption FROM laion_annoy ORDER BY l2Distance(image_embedding, {target:Array(Float32)}) LIMIT 8 |
| 102 | +``` |
| 103 | + |
| 104 | +**Result** |
| 105 | + |
| 106 | +```response |
| 107 | +┌─url──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption──────────────────────────────────────────────────────────────┐ |
| 108 | +│ http://tse1.mm.bing.net/th?id=OIP.R1CUoYp_4hbeFSHBaaB5-gHaFj │ bed bugs and pets can cats carry bed bugs pets adviser │ |
| 109 | +│ http://pet-uploads.adoptapet.com/1/9/c/1963194.jpg?336w │ Domestic Longhair Cat for adoption in Quincy, Massachusetts - Ashley │ |
| 110 | +│ https://thumbs.dreamstime.com/t/cat-bed-12591021.jpg │ Cat on bed Stock Image │ |
| 111 | +│ https://us.123rf.com/450wm/penta/penta1105/penta110500004/9658511-portrait-of-british-short-hair-kitten-lieing-at-sofa-on-sun.jpg │ Portrait of british short hair kitten lieing at sofa on sun. │ |
| 112 | +│ https://www.easypetmd.com/sites/default/files/Wirehaired%20Vizsla%20(2).jpg │ Vizsla (Wirehaired) image 3 │ |
| 113 | +│ https://images.ctfassets.net/yixw23k2v6vo/0000000200009b8800000000/7950f4e1c1db335ef91bb2bc34428de9/dog-cat-flickr-Impatience_1.jpg?w=600&h=400&fm=jpg&fit=thumb&q=65&fl=progressive │ dog and cat image │ |
| 114 | +│ https://i1.wallbox.ru/wallpapers/small/201523/eaa582ee76a31fd.jpg │ cats, kittens, faces, tonkinese │ |
| 115 | +│ https://www.baxterboo.com/images/breeds/medium/cairn-terrier.jpg │ Cairn Terrier Photo │ |
| 116 | +└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────┘ |
| 117 | +
|
| 118 | +8 rows in set. Elapsed: 0.641 sec. Processed 22.06 thousand rows, 49.36 MB (91.53 thousand rows/s., 204.81 MB/s.) |
| 119 | +``` |
| 120 | + |
| 121 | +The speed increased significantly at the cost of less accurate results. This is because the ANN index only provide approximate search results. Note the example searched for similar image embeddings, yet it is also possible to search for positive image caption embeddings. |
| 122 | + |
| 123 | +## Q & A Demo Application {#q-and-a-demo-application} |
| 124 | + |
| 125 | +Above examples demonstrated semantic search and document retrieval using ClickHouse. A very simple but high potential Generative AI example application is presented now. |
| 126 | + |
| 127 | +The application performs the following steps : |
| 128 | + |
| 129 | +1. Accepts a _topic_ as input from the user |
| 130 | +2. Generates an embedding vector for the _topic_ by invoking OpenAI API with model `text-embedding-3-large` |
| 131 | +3. Retrieves highly relevant Wikipedia articles/documents using vector similarity search on the `dbpedia` table |
| 132 | +4. Accepts a free-form question in natural language from the user relating to the _topic_ |
| 133 | +5. Uses the OpenAI `gpt-3.5-turbo` Chat API to answer the question based on the knowledge in the documents retrieved in step #3. |
| 134 | + The documents retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI. |
| 135 | + |
| 136 | +First a couple of conversation examples by running the Q & A application are listed below, followed by the code |
| 137 | +for the Q & A application. Running the application requires an OpenAI API key to be set in the environment |
| 138 | +variable `OPENAI_API_KEY`. |
| 139 | + |
| 140 | +```shell |
| 141 | +$ python3 QandA.py |
| 142 | + |
| 143 | +Enter a topic : FIFA world cup 1990 |
| 144 | +Generating the embedding for 'FIFA world cup 1990' and collecting 100 articles related to it from ClickHouse... |
| 145 | + |
| 146 | +Enter your question : Who won the golden boot |
| 147 | +Salvatore Schillaci of Italy won the Golden Boot at the 1990 FIFA World Cup. |
| 148 | + |
| 149 | + |
| 150 | +Enter a topic : Cricket world cup |
| 151 | +Generating the embedding for 'Cricket world cup' and collecting 100 articles related to it from ClickHouse... |
| 152 | + |
| 153 | +Enter your question : Which country has hosted the world cup most times |
| 154 | +England and Wales have hosted the Cricket World Cup the most times, with the tournament being held in these countries five times - in 1975, 1979, 1983, 1999, and 2019. |
| 155 | + |
| 156 | +$ |
| 157 | +``` |
| 158 | + |
| 159 | +Code : |
| 160 | + |
| 161 | +```Python |
| 162 | +import sys |
| 163 | +import time |
| 164 | +from openai import OpenAI |
| 165 | +import clickhouse_connect |
| 166 | + |
| 167 | +ch_client = clickhouse_connect.get_client(compress=False) # Pass ClickHouse credentials here |
| 168 | +openai_client = OpenAI() # Set the OPENAI_API_KEY environment variable |
| 169 | + |
| 170 | +def get_embedding(text, model): |
| 171 | + text = text.replace("\n", " ") |
| 172 | + return openai_client.embeddings.create(input = [text], model=model, dimensions=1536).data[0].embedding |
| 173 | + |
| 174 | +while True: |
| 175 | + # Take the topic of interest from user |
| 176 | + print("Enter a topic : ", end="", flush=True) |
| 177 | + input_query = sys.stdin.readline() |
| 178 | + input_query = input_query.rstrip() |
| 179 | + |
| 180 | + # Generate an embedding vector for the search topic and query ClickHouse |
| 181 | + print("Generating the embedding for '" + input_query + "' and collecting 100 articles related to it from ClickHouse..."); |
| 182 | + embedding = get_embedding(input_query, |
| 183 | + model='text-embedding-3-large') |
| 184 | + |
| 185 | + params = {'v1':embedding, 'v2':100} |
| 186 | + result = ch_client.query("SELECT id,title,text FROM dbpedia ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params) |
| 187 | + |
| 188 | + # Collect all the matching articles/documents |
| 189 | + results = "" |
| 190 | + for row in result.result_rows: |
| 191 | + results = results + row[2] |
| 192 | + |
| 193 | + print("\nEnter your question : ", end="", flush=True) |
| 194 | + question = sys.stdin.readline(); |
| 195 | + |
| 196 | + # Prompt for the OpenAI Chat API |
| 197 | + query = f"""Use the below content to answer the subsequent question. If the answer cannot be found, write "I don't know." |
| 198 | +
|
| 199 | +Content: |
| 200 | +\"\"\" |
| 201 | +{results} |
| 202 | +\"\"\" |
| 203 | +
|
| 204 | +Question: {question}""" |
| 205 | + |
| 206 | + GPT_MODEL = "gpt-3.5-turbo" |
| 207 | + response = openai_client.chat.completions.create( |
| 208 | + messages=[ |
| 209 | + {'role': 'system', 'content': "You answer questions about {input_query}."}, |
| 210 | + {'role': 'user', 'content': query}, |
| 211 | + ], |
| 212 | + model=GPT_MODEL, |
| 213 | + temperature=0, |
| 214 | + ) |
| 215 | + |
| 216 | + # Print the answer to the question! |
| 217 | + print(response.choices[0].message.content) |
| 218 | + print("\n") |
| 219 | +``` |
| 220 | + |
0 commit comments