Skip to content

Commit ce03fd9

Browse files
authored
Merge pull request #4336 from shankar-iyer/add_laion5b
Add laion 5b dataset to examples
2 parents f9e2262 + 364a7f5 commit ce03fd9

File tree

2 files changed

+204
-0
lines changed

2 files changed

+204
-0
lines changed
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
---
2+
description: 'Dataset containing 100 million vectors from the LAION 5B dataset'
3+
sidebar_label: 'LAION 5B dataset'
4+
slug: /getting-started/example-datasets/laion-5b-dataset
5+
title: 'LAION 5B dataset'
6+
keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings']
7+
---
8+
9+
import search_results_image from '@site/static/images/getting-started/example-datasets/laion5b_visualization_1.png'
10+
import Image from '@theme/IdealImage';
11+
12+
## Introduction {#introduction}
13+
14+
The [LAION 5b dataset](https://laion.ai/blog/laion-5b/) contains 5.85 billion image-text embeddings and
15+
associated image metadata. The embeddings were generated using `Open AI CLIP` model [ViT-L/14](https://huggingface.co/sentence-transformers/clip-ViT-L-14). The
16+
dimension of each embedding vector is `768`.
17+
18+
This dataset can be used to model design, sizing and performance aspects for a large scale,
19+
real world vector search application. The dataset can be used for both text to image search and
20+
image to image search.
21+
22+
## Dataset details {#dataset-details}
23+
24+
The complete dataset is available as a mixture of `npy` and `Parquet` files at [the-eye.eu](https://the-eye.eu/public/AI/cah/laion5b/)
25+
26+
ClickHouse has made available a subset of 100 million vectors in a `S3` bucket.
27+
The `S3` bucket contains 10 `Parquet` files, each `Parquet` file is filled with 10 million rows.
28+
29+
We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
30+
31+
## Steps {#steps}
32+
33+
<VerticalStepper headerLevel="h3">
34+
35+
### Create table {#create-table}
36+
37+
Create the `laion_5b_100m` table to store the embeddings and their associated attributes:
38+
39+
```sql
40+
CREATE TABLE laion_5b_100m
41+
(
42+
id UInt32,
43+
image_path String,
44+
caption String,
45+
NSFW Nullable(String) default 'unknown',
46+
similarity Float32,
47+
LICENSE Nullable(String),
48+
url String,
49+
key String,
50+
status LowCardinality(String),
51+
width Int32,
52+
height Int32,
53+
original_width Int32,
54+
original_height Int32,
55+
exif Nullable(String),
56+
md5 String,
57+
vector Array(Float32) CODEC(NONE)
58+
) ENGINE = MergeTree ORDER BY (id)
59+
```
60+
61+
The `id` is just an incrementing integer. The additional attributes can be used in predicates to understand
62+
vector similarity search combined with post-filtering/pre-filtering as explained in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md)
63+
64+
### Load data {#load-table}
65+
66+
To load the dataset from all `Parquet` files, run the following SQL statement:
67+
68+
```sql
69+
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_*.parquet');
70+
```
71+
72+
The loading of 100 million rows into the table will take a few minutes.
73+
74+
Alternatively, individual SQL statements can be run to load a specific number of files / rows.
75+
76+
```sql
77+
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_part_1_of_10.parquet');
78+
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_part_2_of_10.parquet');
79+
80+
```
81+
82+
### Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
83+
84+
KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset
85+
to the search embedding vector and then ordering the distances to get the nearest neighbours. We can use one of the vectors
86+
from the dataset itself as the search vector. For example:
87+
88+
```sql title="Query"
89+
SELECT id, url
90+
FROM laion_5b_100m
91+
ORDER BY cosineDistance( vector, (SELECT vector FROM laion_5b_100m WHERE id = 9999) ) ASC
92+
LIMIT 20
93+
94+
The vector in the row with id = 9999 is the embedding for an image of a Deli restaurant.
95+
```
96+
97+
```response title="Response"
98+
┌───────id─┬─url───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
99+
1. │ 9999 │ https://certapro.com/belleville/wp-content/uploads/sites/1369/2017/01/McAlistersFairviewHgts.jpg │
100+
2. │ 60180509 │ https://certapro.com/belleville/wp-content/uploads/sites/1369/2017/01/McAlistersFairviewHgts-686x353.jpg │
101+
3. │ 1986089 │ https://www.gannett-cdn.com/-mm-/ceefab710d945bb3432c840e61dce6c3712a7c0a/c=30-0-4392-3280/local/-/media/2017/02/14/FortMyers/FortMyers/636226855169587730-McAlister-s-Exterior-Signage.jpg?width=534&amp;height=401&amp;fit=crop │
102+
4. │ 51559839 │ https://img1.mashed.com/img/gallery/how-rich-is-the-mcalisters-deli-ceo-and-whats-the-average-pay-of-its-employees/intro-1619793841.jpg │
103+
5. │ 22104014 │ https://www.restaurantmagazine.com/wp-content/uploads/2016/04/Largest-McAlisters-Deli-Franchisee-to-Expand-into-Nebraska.jpg │
104+
6. │ 54337236 │ http://www.restaurantnews.com/wp-content/uploads/2015/11/McAlisters-Deli-Giving-Away-Gift-Cards-With-Win-One-Gift-One-Holiday-Promotion.jpg │
105+
7. │ 20770867 │ http://www.restaurantnews.com/wp-content/uploads/2016/04/McAlisters-Deli-Aims-to-Attract-New-Franchisees-in-Florida-as-Chain-Enters-New-Markets.jpg │
106+
8. │ 22493966 │ https://www.restaurantmagazine.com/wp-content/uploads/2016/06/McAlisters-Deli-Aims-to-Attract-New-Franchisees-in-Columbus-Ohio-as-Chain-Expands-feature.jpg │
107+
9. │ 2224351 │ https://holttribe.com/wp-content/uploads/2019/10/60880046-879A-49E4-8E13-1EE75FB24980-900x675.jpeg │
108+
10. │ 30779663 │ https://www.gannett-cdn.com/presto/2018/10/29/PMUR/685f3e50-cce5-46fb-9a66-acb93f6ea5e5-IMG_6587.jpg?crop=2166,2166,x663,y0&amp;width=80&amp;height=80&amp;fit=bounds │
109+
11. │ 54939148 │ https://www.priceedwards.com/sites/default/files/styles/staff_property_listing_block/public/for-lease/images/IMG_9674%20%28Custom%29_1.jpg?itok=sa8hrVBT │
110+
12. │ 95371605 │ http://www.restaurantmagazine.com/wp-content/uploads/2015/08/McAlisters-Deli-Signs-Development-Agreement-with-Kingdom-Foods-to-Grow-in-Southern-Mississippi.jpg │
111+
13. │ 79564563 │ https://www.restaurantmagazine.com/wp-content/uploads/2016/05/McAlisters-Deli-Aims-to-Attract-New-Franchisees-in-Denver-as-Chain-Expands.jpg │
112+
14. │ 76429939 │ http://www.restaurantnews.com/wp-content/uploads/2016/08/McAlisters-Deli-Aims-to-Attract-New-Franchisees-in-Pennsylvania-as-Chain-Expands.jpg │
113+
15. │ 96680635 │ https://img.claz.org/tc/400x320/9w3hll-UQNHGB9WFlhSGAVCWhheBQkeWh5SBAkUWh9SBgsJFxRcBUMNSR4cAQENXhJARwgNTRYcBAtDWh5WRQEJXR5SR1xcFkYKR1tYFkYGR1pVFiVyP0ImaTA │
114+
16. │ 48716846 │ http://tse2.mm.bing.net/th?id=OIP.nN2qJqGUJs_fVNdTiFyGnQHaEc │
115+
17. │ 4472333 │ https://sgi.offerscdn.net/i/zdcs-merchants/05lG0FpXPIvsfiHnT3N8FQE.h200.w220.flpad.v22.bffffff.png │
116+
18. │ 82667887 │ https://irs2.4sqi.net/img/general/200x200/11154479_OEGbrkgWB5fEGrrTkktYvCj1gcdyhZn7TSQSAqN2Yqw.jpg │
117+
19. │ 57525607 │ https://knoji.com/images/logo/mcalistersdelicom.jpg │
118+
20. │ 15785896 │ https://www.groupnimb.com/mimg/merimg/mcalister-s-deli_1446088739.jpg │
119+
└──────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
120+
121+
#highlight-next-line
122+
20 rows in set. Elapsed: 3.968 sec. Processed 100.38 million rows, 320.81 GB (25.30 million rows/s., 80.84 GB/s.)
123+
```
124+
125+
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
126+
With 100 million rows, the above query without a vector index could take a few seconds/minutes to complete.
127+
128+
### Build a vector similarity index {#build-vector-similarity-index}
129+
130+
Run the following SQL to define and build a vector similarity index on the `vector` column of the `laion_5b_100m` table :
131+
132+
```sql
133+
ALTER TABLE laion_5b_100m ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 768, 'bf16', 64, 512);
134+
135+
ALTER TABLE laion_5b_100m MATERIALIZE INDEX vector_index SETTINGS mutations_sync = 2;
136+
```
137+
138+
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
139+
The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`.
140+
Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality
141+
corresponding to selected values.
142+
143+
Building and saving the index could even take a few hours for the full l00 million dataset, depending on the number of CPU cores available and the storage bandwidth.
144+
145+
### Perform ANN search {#perform-ann-search}
146+
147+
Once the vector similarity index has been built, vector search queries will automatically use the index:
148+
149+
```sql title="Query"
150+
SELECT id, url
151+
FROM laion_5b_100m
152+
ORDER BY cosineDistance( vector, (SELECT vector FROM laion_5b_100m WHERE id = 9999) ) ASC
153+
LIMIT 20
154+
155+
```
156+
157+
The first time load of the vector index into memory could take a few seconds/minutes.
158+
159+
### Generate embeddings for search query {#generating-embeddings-for-search-query}
160+
161+
The `LAION 5b` dataset embedding vectors were generated using `OpenAI CLIP` model `ViT-L/14`.
162+
163+
An example Python script is provided below to demonstrate how to programmatically generate
164+
embedding vectors using the `CLIP` APIs. The search embedding vector
165+
is then passed as an argument to the [`cosineDistance()`](/sql-reference/functions/distance-functions#cosineDistance) function in the `SELECT` query.
166+
167+
To install the `clip` package, please refer to the [OpenAI GitHub repository](https://github.com/openai/clip).
168+
169+
```python
170+
import torch
171+
import clip
172+
import numpy as np
173+
import sys
174+
import clickhouse_connect
175+
176+
device = "cuda" if torch.cuda.is_available() else "cpu"
177+
model, preprocess = clip.load("ViT-L/14", device=device)
178+
179+
# Search for images that contain both a dog and a cat
180+
text = clip.tokenize(["a dog and a cat"]).to(device)
181+
182+
with torch.no_grad():
183+
text_features = model.encode_text(text)
184+
np_arr = text_features.detach().cpu().numpy()
185+
186+
# Pass ClickHouse credentials here
187+
chclient = clickhouse_connect.get_client()
188+
189+
params = {'v1': list(np_arr[0])}
190+
result = chclient.query("SELECT id, url FROM laion_5b_100m ORDER BY cosineDistance(vector, %(v1)s) LIMIT 100",
191+
parameters=params)
192+
193+
# Write the results to a simple HTML page that can be opened in the browser. Some URLs may have become obsolete.
194+
print("<html>")
195+
for r in result.result_rows:
196+
print("<img src = ", r[1], 'width="200" height="200">')
197+
print("</html>")
198+
```
199+
200+
The result of the above search is shown below:
201+
202+
<Image img={search_results_image} alt="Vector Similarity Search Results" size="md"/>
203+
204+
</VerticalStepper>
6.74 MB
Loading

0 commit comments

Comments
 (0)