Skip to content

Commit df59651

Browse files
committed
minor formatting changes
1 parent 75feed1 commit df59651

File tree

1 file changed

+31
-24
lines changed

1 file changed

+31
-24
lines changed

docs/getting-started/example-datasets/laion5b.md

Lines changed: 31 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,40 @@
11
---
2-
description: 'Dataset containing 100 million vectors from the LAION 5b dataset'
3-
sidebar_label: 'LAION 5b dataset'
2+
description: 'Dataset containing 100 million vectors from the LAION 5B dataset'
3+
sidebar_label: 'LAION 5B dataset'
44
slug: /getting-started/example-datasets/laion-5b-dataset
5-
title: 'LAION 5b dataset'
5+
title: 'LAION 5B dataset'
66
keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings']
77
---
88

99
import search_results_image from '@site/static/images/getting-started/example-datasets/laion5b_visualization_1.png'
10+
import Image from '@theme/IdealImage';
1011

1112
## Introduction {#introduction}
1213

1314
The [LAION 5b dataset](https://laion.ai/blog/laion-5b/) contains 5.85 billion image-text embeddings and
1415
associated image metadata. The embeddings were generated using `Open AI CLIP` model `ViT-L/14`. The
1516
dimension of each embedding vector is `768`.
1617

17-
This dataset can be used to model the design, sizing and performance aspects for a large scale,
18+
This dataset can be used to model design, sizing and performance aspects for a large scale,
1819
real world vector search application. The dataset can be used for both text to image search and
1920
image to image search.
2021

2122
## Dataset details {#dataset-details}
2223

23-
The complete dataset is available as a mixture of `npy` and `Parquet` files at https://the-eye.eu/public/AI/cah/laion5b/
24+
The complete dataset is available as a mixture of `npy` and `Parquet` files at [the-eye.eu](https://the-eye.eu/public/AI/cah/laion5b/)
2425

25-
ClickHouse has made available a subset of 100 million vectors in a `S3` bucket. The `S3` bucket contains 10 `Parquet` files, each `Parquet` file
26-
is filled with 10 million rows.
26+
ClickHouse has made available a subset of 100 million vectors in a `S3` bucket.
27+
The `S3` bucket contains 10 `Parquet` files, each `Parquet` file is filled with 10 million rows.
2728

28-
We recommend users to first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
29+
We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
2930

30-
## Create table {#create-table}
31+
## Steps {#steps}
3132

32-
Create the `laion_5b_100m` table to store the embeddings and their associated attributes :
33+
<VerticalStepper headerLevel="h3">
34+
35+
### Create table {#create-table}
36+
37+
Create the `laion_5b_100m` table to store the embeddings and their associated attributes:
3338

3439
```sql
3540
CREATE TABLE laion_5b_100m
@@ -56,9 +61,9 @@ CREATE TABLE laion_5b_100m
5661
The `id` is just an incrementing integer. The additional attributes can be used in predicates to understand
5762
vector similarity search combined with post-filtering/pre-filtering as explained in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md)
5863

59-
## Load table {#load-table}
64+
### Load data {#load-table}
6065

61-
To load the dataset from all `Parquet` files, run the following SQL statement :
66+
To load the dataset from all `Parquet` files, run the following SQL statement:
6267

6368
```sql
6469
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_*.parquet');
@@ -71,11 +76,10 @@ Alternatively, individual SQL statements can be run to load a specific number of
7176
```sql
7277
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_part_1_of_10.parquet');
7378
INSERT INTO laion_5b_100m SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/laion-5b/laion5b_100m_part_2_of_10.parquet');
74-
...
75-
79+
7680
```
7781

78-
## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
82+
### Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search}
7983

8084
KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset
8185
to the search embedding vector and then ordering the distances to get the nearest neighbours. We can use one of the vectors
@@ -121,7 +125,7 @@ The vector in the row with id = 9999 is the embedding for an image of a Deli res
121125
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
122126
With 100 million rows, the above query without a vector index could take a few seconds/minutes to complete.
123127

124-
## Build a vector similarity index {#build-vector-similarity-index}
128+
### Build a vector similarity index {#build-vector-similarity-index}
125129

126130
Run the following SQL to define and build a vector similarity index on the `vector` column of the `laion_5b_100m` table :
127131

@@ -132,13 +136,13 @@ ALTER TABLE laion_5b_100m MATERIALIZE INDEX vector_index SETTINGS mutations_sync
132136
```
133137

134138
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
135-
The above statement uses the values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`.
139+
The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`.
136140
Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality
137141
corresponding to selected values.
138142

139-
Building and saving the index could even take a few hours for the full l00 million dataset, depending on number of CPU cores available and the storage bandwidth.
143+
Building and saving the index could even take a few hours for the full l00 million dataset, depending on the number of CPU cores available and the storage bandwidth.
140144

141-
## Perform ANN search {#perform-ann-search}
145+
### Perform ANN search {#perform-ann-search}
142146

143147
Once the vector similarity index has been built, vector search queries will automatically use the index:
144148

@@ -152,14 +156,15 @@ LIMIT 20
152156

153157
The first time load of the vector index into memory could take a few seconds/minutes.
154158

155-
## Generating embeddings for search query {#generating-embeddings-for-search-query}
159+
### Generate embeddings for search query {#generating-embeddings-for-search-query}
156160

157161
The `LAION 5b` dataset embedding vectors were generated using `OpenAI CLIP` model `ViT-L/14`.
158-
An example Python script is listed below to demonstrate how to programmatically generate
162+
163+
An example Python script is provided below to demonstrate how to programmatically generate
159164
embedding vectors using the `CLIP` APIs. The search embedding vector
160-
is then passed as an argument to the `cosineDistance()` function in the `SELECT` query.
165+
is then passed as an argument to the [`cosineDistance()`](/sql-reference/functions/distance-functions#cosineDistance) function in the `SELECT` query.
161166

162-
To install the `clip` package, please refer to https://github.com/openai/clip.
167+
To install the `clip` package, please refer to the [OpenAI GitHub repository](https://github.com/openai/clip).
163168

164169
```python
165170
import torch
@@ -192,6 +197,8 @@ with torch.no_grad():
192197
print("</html>")
193198
```
194199

195-
Result of above search :
200+
The result of the above search is shown below:
196201

197202
<Image img={search_results_image} alt="Vector Similarity Search Results" size="md"/>
203+
204+
</VerticalStepper>

0 commit comments

Comments
 (0)