You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import search_results_image from '@site/static/images/getting-started/example-datasets/laion5b_visualization_1.png'
10
+
import Image from '@theme/IdealImage';
10
11
11
12
## Introduction {#introduction}
12
13
13
14
The [LAION 5b dataset](https://laion.ai/blog/laion-5b/) contains 5.85 billion image-text embeddings and
14
15
associated image metadata. The embeddings were generated using `Open AI CLIP` model `ViT-L/14`. The
15
16
dimension of each embedding vector is `768`.
16
17
17
-
This dataset can be used to model the design, sizing and performance aspects for a large scale,
18
+
This dataset can be used to model design, sizing and performance aspects for a large scale,
18
19
real world vector search application. The dataset can be used for both text to image search and
19
20
image to image search.
20
21
21
22
## Dataset details {#dataset-details}
22
23
23
-
The complete dataset is available as a mixture of `npy` and `Parquet` files at https://the-eye.eu/public/AI/cah/laion5b/
24
+
The complete dataset is available as a mixture of `npy` and `Parquet` files at [the-eye.eu](https://the-eye.eu/public/AI/cah/laion5b/)
24
25
25
-
ClickHouse has made available a subset of 100 million vectors in a `S3` bucket. The `S3` bucket contains 10 `Parquet` files, each `Parquet` file
26
-
is filled with 10 million rows.
26
+
ClickHouse has made available a subset of 100 million vectors in a `S3` bucket.
27
+
The `S3` bucket contains 10 `Parquet` files, each `Parquet` file is filled with 10 million rows.
27
28
28
-
We recommend users to first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
29
+
We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
29
30
30
-
## Create table {#create-table}
31
+
## Steps {#steps}
31
32
32
-
Create the `laion_5b_100m` table to store the embeddings and their associated attributes :
33
+
<VerticalStepperheaderLevel="h3">
34
+
35
+
### Create table {#create-table}
36
+
37
+
Create the `laion_5b_100m` table to store the embeddings and their associated attributes:
33
38
34
39
```sql
35
40
CREATETABLElaion_5b_100m
@@ -56,9 +61,9 @@ CREATE TABLE laion_5b_100m
56
61
The `id` is just an incrementing integer. The additional attributes can be used in predicates to understand
57
62
vector similarity search combined with post-filtering/pre-filtering as explained in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md)
58
63
59
-
## Load table {#load-table}
64
+
###Load data {#load-table}
60
65
61
-
To load the dataset from all `Parquet` files, run the following SQL statement:
66
+
To load the dataset from all `Parquet` files, run the following SQL statement:
62
67
63
68
```sql
64
69
INSERT INTO laion_5b_100m SELECT*FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_*.parquet');
@@ -71,11 +76,10 @@ Alternatively, individual SQL statements can be run to load a specific number of
71
76
```sql
72
77
INSERT INTO laion_5b_100m SELECT*FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_part_1_of_10.parquet');
73
78
INSERT INTO laion_5b_100m SELECT*FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_part_2_of_10.parquet');
74
-
...
75
-
79
+
⋮
76
80
```
77
81
78
-
## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
82
+
###Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
79
83
80
84
KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset
81
85
to the search embedding vector and then ordering the distances to get the nearest neighbours. We can use one of the vectors
@@ -121,7 +125,7 @@ The vector in the row with id = 9999 is the embedding for an image of a Deli res
121
125
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
122
126
With 100 million rows, the above query without a vector index could take a few seconds/minutes to complete.
123
127
124
-
## Build a vector similarity index {#build-vector-similarity-index}
128
+
###Build a vector similarity index {#build-vector-similarity-index}
125
129
126
130
Run the following SQL to define and build a vector similarity index on the `vector` column of the `laion_5b_100m` table :
127
131
@@ -132,13 +136,13 @@ ALTER TABLE laion_5b_100m MATERIALIZE INDEX vector_index SETTINGS mutations_sync
132
136
```
133
137
134
138
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
135
-
The above statement uses the values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`.
139
+
The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`.
136
140
Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality
137
141
corresponding to selected values.
138
142
139
-
Building and saving the index could even take a few hours for the full l00 million dataset, depending on number of CPU cores available and the storage bandwidth.
143
+
Building and saving the index could even take a few hours for the full l00 million dataset, depending on the number of CPU cores available and the storage bandwidth.
140
144
141
-
## Perform ANN search {#perform-ann-search}
145
+
###Perform ANN search {#perform-ann-search}
142
146
143
147
Once the vector similarity index has been built, vector search queries will automatically use the index:
144
148
@@ -152,14 +156,15 @@ LIMIT 20
152
156
153
157
The first time load of the vector index into memory could take a few seconds/minutes.
154
158
155
-
##Generating embeddings for search query {#generating-embeddings-for-search-query}
159
+
### Generate embeddings for search query {#generating-embeddings-for-search-query}
156
160
157
161
The `LAION 5b` dataset embedding vectors were generated using `OpenAI CLIP` model `ViT-L/14`.
158
-
An example Python script is listed below to demonstrate how to programmatically generate
162
+
163
+
An example Python script is provided below to demonstrate how to programmatically generate
159
164
embedding vectors using the `CLIP` APIs. The search embedding vector
160
-
is then passed as an argument to the `cosineDistance()` function in the `SELECT` query.
165
+
is then passed as an argument to the [`cosineDistance()`](/sql-reference/functions/distance-functions#cosineDistance) function in the `SELECT` query.
161
166
162
-
To install the `clip` package, please refer to https://github.com/openai/clip.
167
+
To install the `clip` package, please refer to the [OpenAI GitHub repository](https://github.com/openai/clip).
0 commit comments