Skip to content

Commit b985c72

Browse files
committed
Add laion 5b dataset
1 parent 762a3c4 commit b985c72

File tree

2 files changed

+197
-0
lines changed

2 files changed

+197
-0
lines changed
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
---
2+
description: 'Dataset containing 100 million vectors from the LAION 5b dataset'
3+
sidebar_label: 'LAION 5b dataset'
4+
slug: /getting-started/example-datasets/laion-5b-dataset
5+
title: 'LAION 5b dataset'
6+
keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings']
7+
---
8+
9+
import search_results_image from '@site/static/images/getting-started/example-datasets/laion5b_visualization_1.png'
10+
11+
## Introduction {#introduction}
12+
13+
The [LAION 5b dataset](https://laion.ai/blog/laion-5b/) contains 5.85 billion image-text embeddings and
14+
associated image metadata. The embeddings were generated using `Open AI CLIP` model `ViT-L/14`. The
15+
dimension of each embedding vector is `768`.
16+
17+
This dataset can be used to model the design, sizing and performance aspects for a large scale,
18+
real world vector search application. The dataset can be used for both text to image search and
19+
image to image search.
20+
21+
## Dataset details {#dataset-details}
22+
23+
The complete dataset is available as a mixture of `npy` and `Parquet` files at https://the-eye.eu/public/AI/cah/laion5b/
24+
25+
ClickHouse has made available a subset of 100 million vectors in a `S3` bucket. The `S3` bucket contains 10 `Parquet` files, each `Parquet` file
26+
is filled with 10 million rows.
27+
28+
We recommend users to first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
29+
30+
## Create table {#create-table}
31+
32+
Create the `laion_5b_100m` table to store the embeddings and their associated attributes :
33+
34+
```sql
35+
CREATE TABLE laion_5b_100m
36+
(
37+
id UInt32,
38+
image_path String,
39+
caption String,
40+
NSFW Nullable(String) default 'unknown',
41+
similarity Float32,
42+
LICENSE Nullable(String),
43+
url String,
44+
key String,
45+
status LowCardinality(String),
46+
width Int32,
47+
height Int32,
48+
original_width Int32,
49+
original_height Int32,
50+
exif Nullable(String),
51+
md5 String,
52+
vector Array(Float32) CODEC(NONE)
53+
) ENGINE = MergeTree ORDER BY (id)
54+
```
55+
56+
The `id` is just an incrementing integer. The additional attributes can be used in predicates to understand
57+
vector similarity search combined with post-filtering/pre-filtering as explained in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md)
58+
59+
## Load table {#load-table}
60+
61+
To load the dataset from all `Parquet` files, run the following SQL statement :
62+
63+
```sql
64+
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_*.parquet');
65+
```
66+
67+
The loading of 100 million rows into the table will take a few minutes.
68+
69+
Alternatively, individual SQL statements can be run to load a specific number of files / rows.
70+
71+
```sql
72+
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_part_1_of_10.parquet');
73+
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_part_2_of_10.parquet');
74+
...
75+
76+
```
77+
78+
## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
79+
80+
KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset
81+
to the search embedding vector and then ordering the distances to get the nearest neighbours. We can use one of the vectors
82+
from the dataset itself as the search vector. For example:
83+
84+
```sql title="Query"
85+
SELECT id, url
86+
FROM laion_5b_100m
87+
ORDER BY cosineDistance( vector, (SELECT vector FROM laion_5b_100m WHERE id = 9999) ) ASC
88+
LIMIT 20
89+
90+
The vector in the row with id = 9999 is the embedding for an image of a Deli restaurant.
91+
```
92+
93+
```response title="Response"
94+
┌───────id─┬─url───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
95+
1. │ 9999 │ https://certapro.com/belleville/wp-content/uploads/sites/1369/2017/01/McAlistersFairviewHgts.jpg │
96+
2. │ 60180509 │ https://certapro.com/belleville/wp-content/uploads/sites/1369/2017/01/McAlistersFairviewHgts-686x353.jpg │
97+
3. │ 1986089 │ https://www.gannett-cdn.com/-mm-/ceefab710d945bb3432c840e61dce6c3712a7c0a/c=30-0-4392-3280/local/-/media/2017/02/14/FortMyers/FortMyers/636226855169587730-McAlister-s-Exterior-Signage.jpg?width=534&height=401&fit=crop │
98+
4. │ 51559839 │ https://img1.mashed.com/img/gallery/how-rich-is-the-mcalisters-deli-ceo-and-whats-the-average-pay-of-its-employees/intro-1619793841.jpg │
99+
5. │ 22104014 │ https://www.restaurantmagazine.com/wp-content/uploads/2016/04/Largest-McAlisters-Deli-Franchisee-to-Expand-into-Nebraska.jpg │
100+
6. │ 54337236 │ http://www.restaurantnews.com/wp-content/uploads/2015/11/McAlisters-Deli-Giving-Away-Gift-Cards-With-Win-One-Gift-One-Holiday-Promotion.jpg │
101+
7. │ 20770867 │ http://www.restaurantnews.com/wp-content/uploads/2016/04/McAlisters-Deli-Aims-to-Attract-New-Franchisees-in-Florida-as-Chain-Enters-New-Markets.jpg │
102+
8. │ 22493966 │ https://www.restaurantmagazine.com/wp-content/uploads/2016/06/McAlisters-Deli-Aims-to-Attract-New-Franchisees-in-Columbus-Ohio-as-Chain-Expands-feature.jpg │
103+
9. │ 2224351 │ https://holttribe.com/wp-content/uploads/2019/10/60880046-879A-49E4-8E13-1EE75FB24980-900x675.jpeg │
104+
10. │ 30779663 │ https://www.gannett-cdn.com/presto/2018/10/29/PMUR/685f3e50-cce5-46fb-9a66-acb93f6ea5e5-IMG_6587.jpg?crop=2166,2166,x663,y0&width=80&height=80&fit=bounds │
105+
11. │ 54939148 │ https://www.priceedwards.com/sites/default/files/styles/staff_property_listing_block/public/for-lease/images/IMG_9674%20%28Custom%29_1.jpg?itok=sa8hrVBT │
106+
12. │ 95371605 │ http://www.restaurantmagazine.com/wp-content/uploads/2015/08/McAlisters-Deli-Signs-Development-Agreement-with-Kingdom-Foods-to-Grow-in-Southern-Mississippi.jpg │
107+
13. │ 79564563 │ https://www.restaurantmagazine.com/wp-content/uploads/2016/05/McAlisters-Deli-Aims-to-Attract-New-Franchisees-in-Denver-as-Chain-Expands.jpg │
108+
14. │ 76429939 │ http://www.restaurantnews.com/wp-content/uploads/2016/08/McAlisters-Deli-Aims-to-Attract-New-Franchisees-in-Pennsylvania-as-Chain-Expands.jpg │
109+
15. │ 96680635 │ https://img.claz.org/tc/400x320/9w3hll-UQNHGB9WFlhSGAVCWhheBQkeWh5SBAkUWh9SBgsJFxRcBUMNSR4cAQENXhJARwgNTRYcBAtDWh5WRQEJXR5SR1xcFkYKR1tYFkYGR1pVFiVyP0ImaTA │
110+
16. │ 48716846 │ http://tse2.mm.bing.net/th?id=OIP.nN2qJqGUJs_fVNdTiFyGnQHaEc │
111+
17. │ 4472333 │ https://sgi.offerscdn.net/i/zdcs-merchants/05lG0FpXPIvsfiHnT3N8FQE.h200.w220.flpad.v22.bffffff.png │
112+
18. │ 82667887 │ https://irs2.4sqi.net/img/general/200x200/11154479_OEGbrkgWB5fEGrrTkktYvCj1gcdyhZn7TSQSAqN2Yqw.jpg │
113+
19. │ 57525607 │ https://knoji.com/images/logo/mcalistersdelicom.jpg │
114+
20. │ 15785896 │ https://www.groupnimb.com/mimg/merimg/mcalister-s-deli_1446088739.jpg │
115+
└──────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
116+
117+
#highlight-next-line
118+
20 rows in set. Elapsed: 3.968 sec. Processed 100.38 million rows, 320.81 GB (25.30 million rows/s., 80.84 GB/s.)
119+
```
120+
121+
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
122+
With 100 million rows, the above query without a vector index could take a few seconds/minutes to complete.
123+
124+
## Build a vector similarity index {#build-vector-similarity-index}
125+
126+
Run the following SQL to define and build a vector similarity index on the `vector` column of the `laion_5b_100m` table :
127+
128+
```sql
129+
ALTER TABLE laion_5b_100m ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 768, 'bf16', 64, 512);
130+
131+
ALTER TABLE laion_5b_100m MATERIALIZE INDEX vector_index SETTINGS mutations_sync = 2;
132+
```
133+
134+
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
135+
The above statement uses the values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`.
136+
Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality
137+
corresponding to selected values.
138+
139+
Building and saving the index could even take a few hours for the full l00 million dataset, depending on number of CPU cores available and the storage bandwidth.
140+
141+
## Perform ANN search {#perform-ann-search}
142+
143+
Once the vector similarity index has been built, vector search queries will automatically use the index:
144+
145+
```sql title="Query"
146+
SELECT id, url
147+
FROM laion_5b_100m
148+
ORDER BY cosineDistance( vector, (SELECT vector FROM laion_5b_100m WHERE id = 9999) ) ASC
149+
LIMIT 20
150+
151+
```
152+
153+
The first time load of the vector index into memory could take a few seconds/minutes.
154+
155+
## Generating embeddings for search query {#generating-embeddings-for-search-query}
156+
157+
The `LAION 5b` dataset embedding vectors were generated using `OpenAI CLIP` model `ViT-L/14`.
158+
An example Python script is listed below to demonstrate how to programmatically generate
159+
embedding vectors using the `CLIP` APIs. The search embedding vector
160+
is then passed as an argument to the `cosineDistance()` function in the `SELECT` query.
161+
162+
To install the `clip` package, please refer to https://github.com/openai/clip.
163+
164+
```python
165+
import torch
166+
import clip
167+
import numpy as np
168+
import sys
169+
import clickhouse_connect
170+
171+
device = "cuda" if torch.cuda.is_available() else "cpu"
172+
model, preprocess = clip.load("ViT-L/14", device=device)
173+
174+
# Search for images that contain both a dog and a cat
175+
text = clip.tokenize(["a dog and a cat"]).to(device)
176+
177+
with torch.no_grad():
178+
text_features = model.encode_text(text)
179+
np_arr = text_features.detach().cpu().numpy()
180+
181+
# Pass ClickHouse credentials here
182+
chclient = clickhouse_connect.get_client()
183+
184+
params = {'v1': list(np_arr[0])}
185+
result = chclient.query("SELECT id, url FROM laion_5b_100m ORDER BY cosineDistance(vector, %(v1)s) LIMIT 100",
186+
parameters=params)
187+
188+
# Write the results to a simple HTML page that can be opened in the browser. Some URLs may have become obsolete.
189+
print("<html>")
190+
for r in result.result_rows:
191+
print("<img src = ", r[1], 'width="200" height="200">')
192+
print("</html>")
193+
```
194+
195+
Result of above search :
196+
197+
<Image img={search_results_image} alt="Vector Similarity Search Results" size="md"/>
6.74 MB
Loading

0 commit comments

Comments
 (0)