Skip to content

Commit b94ad8a

Browse files
authored
Merge pull request #4378 from shankar-iyer/hackernews_dataset
Add HackerNews dataset for vector search
2 parents a80bb3a + 6962f87 commit b94ad8a

File tree

1 file changed

+348
-0
lines changed

1 file changed

+348
-0
lines changed
Lines changed: 348 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,348 @@
1+
---
2+
description: 'Dataset containing 28+ million Hacker News postings & their vector embeddings'
3+
sidebar_label: 'Hacker News Vector Search dataset'
4+
slug: /getting-started/example-datasets/hackernews-vector-search-dataset
5+
title: 'Hacker News Vector Search dataset'
6+
keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings']
7+
---
8+
9+
## Introduction {#introduction}
10+
11+
The [Hacker News dataset](https://news.ycombinator.com/) contains 28.74 million
12+
postings and their vector embeddings. The embeddings were generated using [SentenceTransformers](https://sbert.net/) model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). The dimension of each embedding vector is `384`.
13+
14+
This dataset can be used to walk through the design, sizing and performance aspects for a large scale,
15+
real world vector search application built on top of user generated, textual data.
16+
17+
## Dataset details {#dataset-details}
18+
19+
The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a [S3 bucket](https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet)
20+
21+
We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
22+
23+
## Steps {#steps}
24+
25+
<VerticalStepper headerLevel="h3">
26+
27+
### Create table {#create-table}
28+
29+
Create the `hackernews` table to store the postings & their embeddings and associated attributes:
30+
31+
```sql
32+
CREATE TABLE hackernews
33+
(
34+
`id` Int32,
35+
`doc_id` Int32,
36+
`text` String,
37+
`vector` Array(Float32),
38+
`node_info` Tuple(
39+
start Nullable(UInt64),
40+
end Nullable(UInt64)),
41+
`metadata` String,
42+
`type` Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
43+
`by` LowCardinality(String),
44+
`time` DateTime,
45+
`title` String,
46+
`post_score` Int32,
47+
`dead` UInt8,
48+
`deleted` UInt8,
49+
`length` UInt32
50+
)
51+
ENGINE = MergeTree
52+
ORDER BY id;
53+
```
54+
55+
The `id` is just an incrementing integer. The additional attributes can be used in predicates to understand
56+
vector similarity search combined with post-filtering/pre-filtering as explained in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md)
57+
58+
### Load data {#load-table}
59+
60+
To load the dataset from the `Parquet` file, run the following SQL statement:
61+
62+
```sql
63+
INSERT INTO hackernews SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet');
64+
```
65+
66+
Inserting 28.74 million rows into the table will take a few minutes.
67+
68+
### Build a vector similarity index {#build-vector-similarity-index}
69+
70+
Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table:
71+
72+
```sql
73+
ALTER TABLE hackernews ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 384, 'bf16', 64, 512);
74+
75+
ALTER TABLE hackernews MATERIALIZE INDEX vector_index SETTINGS mutations_sync = 2;
76+
```
77+
78+
The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).
79+
The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`.
80+
Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality
81+
corresponding to selected values.
82+
83+
Building and saving the index could even take a few minutes/hour for the full 28.74 million dataset, depending on the number of CPU cores available and the storage bandwidth.
84+
85+
### Perform ANN search {#perform-ann-search}
86+
87+
Once the vector similarity index has been built, vector search queries will automatically use the index:
88+
89+
```sql title="Query"
90+
SELECT id, title, text
91+
FROM hackernews
92+
ORDER BY cosineDistance( vector, <search vector>)
93+
LIMIT 10
94+
95+
```
96+
97+
The first time load of the vector index into memory could take a few seconds/minutes.
98+
99+
### Generate embeddings for search query {#generating-embeddings-for-search-query}
100+
101+
[Sentence Transformers](https://www.sbert.net/) provide local, easy to use embedding
102+
models for capturing the semantic meaning of sentences and paragraphs.
103+
104+
The dataset in this HackerNews dataset contains vector emebeddings generated from the
105+
[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.
106+
107+
An example Python script is provided below to demonstrate how to programmatically generate
108+
embedding vectors using `sentence_transformers1 Python package. The search embedding vector
109+
is then passed as an argument to the [`cosineDistance()`](/sql-reference/functions/distance-functions#cosineDistance) function in the `SELECT` query.
110+
111+
```python
112+
from sentence_transformers import SentenceTransformer
113+
import sys
114+
115+
import clickhouse_connect
116+
117+
print("Initializing...")
118+
119+
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
120+
121+
chclient = clickhouse_connect.get_client() # ClickHouse credentials here
122+
123+
while True:
124+
# Take the search query from user
125+
print("Enter a search query :")
126+
input_query = sys.stdin.readline();
127+
texts = [input_query]
128+
129+
# Run the model and obtain search vector
130+
print("Generating the embedding for ", input_query);
131+
embeddings = model.encode(texts)
132+
133+
print("Querying ClickHouse...")
134+
params = {'v1':list(embeddings[0]), 'v2':20}
135+
result = chclient.query("SELECT id, title, text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)
136+
print("Results :")
137+
for row in result.result_rows:
138+
print(row[0], row[2][:100])
139+
print("---------")
140+
141+
```
142+
143+
An example of running the above Python script and similarity search results are shown below
144+
(only 100 characters from each of the top 20 posts are printed):
145+
146+
```text
147+
Initializing...
148+
149+
Enter a search query :
150+
Are OLAP cubes useful
151+
152+
Generating the embedding for "Are OLAP cubes useful"
153+
154+
Querying ClickHouse...
155+
156+
Results :
157+
158+
27742647 smartmic:
159+
slt2021: OLAP Cube is not dead, as long as you use some form of:<p>1. GROUP BY multiple fi
160+
---------
161+
27744260 georgewfraser:A data mart is a logical organization of data to help humans understand the schema. Wh
162+
---------
163+
27761434 mwexler:&quot;We model data according to rigorous frameworks like Kimball or Inmon because we must r
164+
---------
165+
28401230 chotmat:
166+
erosenbe0: OLAP database is just a copy, replica, or archive of data with a schema designe
167+
---------
168+
22198879 Merick:+1 for Apache Kylin, it&#x27;s a great project and awesome open source community. If anyone i
169+
---------
170+
27741776 crazydoggers:I always felt the value of an OLAP cube was uncovering questions you may not know to as
171+
---------
172+
22189480 shadowsun7:
173+
_Codemonkeyism: After maintaining an OLAP cube system for some years, I&#x27;m not that
174+
---------
175+
27742029 smartmic:
176+
gengstrand: My first exposure to OLAP was on a team developing a front end to Essbase that
177+
---------
178+
22364133 irfansharif:
179+
simo7: I&#x27;m wondering how this technology could work for OLAP cubes.<p>An OLAP cube
180+
---------
181+
23292746 scoresmoke:When I was developing my pet project for Web analytics (<a href="https:&#x2F;&#x2F;github
182+
---------
183+
22198891 js8:It seems that the article makes a categorical error, arguing that OLAP cubes were replaced by co
184+
---------
185+
28421602 chotmat:
186+
7thaccount: Is there any advantage to OLAP cube over plain SQL (large historical database r
187+
---------
188+
22195444 shadowsun7:
189+
lkcubing: Thanks for sharing. Interesting write up.<p>While this article accurately capt
190+
---------
191+
22198040 lkcubing:Thanks for sharing. Interesting write up.<p>While this article accurately captures the issu
192+
---------
193+
3973185 stefanu:
194+
sgt: Interesting idea. Ofcourse, OLAP isn't just about the underlying cubes and dimensions,
195+
---------
196+
22190903 shadowsun7:
197+
js8: It seems that the article makes a categorical error, arguing that OLAP cubes were r
198+
---------
199+
28422241 sradman:OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the history of
200+
---------
201+
28421480 chotmat:
202+
sradman: OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the
203+
---------
204+
27742515 BadInformatics:
205+
quantified: OP posts with inverted condition: “OLAP != OLAP Cube” is the actual titl
206+
---------
207+
28422935 chotmat:
208+
rstuart4133: I remember hearing about OLAP cubes donkey&#x27;s years ago (probably not far
209+
---------
210+
```
211+
212+
## Summarization demo application {#summarization-demo-application}
213+
214+
The example above demonstrated semantic search and document retrieval using ClickHouse.
215+
216+
A very simple but high potential generative AI example application is presented next.
217+
218+
The application performs the following steps:
219+
220+
1. Accepts a _topic_ as input from the user
221+
2. Generates an embedding vector for the _topic_ by using the `SentenceTransformers` with model `all-MiniLM-L6-v2`
222+
3. Retrieves highly relevant posts/comments using vector similarity search on the `hackernews` table
223+
4. Uses `LangChain` and OpenAI `gpt-3.5-turbo` Chat API to **summarize** the content retrieved in step #3.
224+
The posts/comments retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI.
225+
226+
An example from running the summarization application is first listed below, followed by the code
227+
for the summarization application. Running the application requires an OpenAI API key to be set in the environment
228+
variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openai.com.
229+
230+
This application demonstrates a Generative AI use-case that is applicable to multiple enterprise domains like :
231+
customer sentiment analysis, technical support automation, mining user conversations, legal documents, medical records,
232+
meeting transcripts, financial statements, etc
233+
234+
```shell
235+
$ python3 summarize.py
236+
237+
Enter a search topic :
238+
ClickHouse performance experiences
239+
240+
Generating the embedding for ----> ClickHouse performance experiences
241+
242+
Querying ClickHouse to retrieve relevant articles...
243+
244+
Initializing chatgpt-3.5-turbo model...
245+
246+
Summarizing search results retrieved from ClickHouse...
247+
248+
Summary from chatgpt-3.5:
249+
The discussion focuses on comparing ClickHouse with various databases like TimescaleDB, Apache Spark,
250+
AWS Redshift, and QuestDB, highlighting ClickHouse's cost-efficient high performance and suitability
251+
for analytical applications. Users praise ClickHouse for its simplicity, speed, and resource efficiency
252+
in handling large-scale analytics workloads, although some challenges like DMLs and difficulty in backups
253+
are mentioned. ClickHouse is recognized for its real-time aggregate computation capabilities and solid
254+
engineering, with comparisons made to other databases like Druid and MemSQL. Overall, ClickHouse is seen
255+
as a powerful tool for real-time data processing, analytics, and handling large volumes of data
256+
efficiently, gaining popularity for its impressive performance and cost-effectiveness.
257+
```
258+
259+
Code for the above application :
260+
261+
```python
262+
print("Initializing...")
263+
264+
import sys
265+
import json
266+
import time
267+
from sentence_transformers import SentenceTransformer
268+
269+
import clickhouse_connect
270+
271+
from langchain.docstore.document import Document
272+
from langchain.text_splitter import CharacterTextSplitter
273+
from langchain.chat_models import ChatOpenAI
274+
from langchain.prompts import PromptTemplate
275+
from langchain.chains.summarize import load_summarize_chain
276+
import textwrap
277+
import tiktoken
278+
279+
def num_tokens_from_string(string: str, encoding_name: str) -> int:
280+
encoding = tiktoken.encoding_for_model(encoding_name)
281+
num_tokens = len(encoding.encode(string))
282+
return num_tokens
283+
284+
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
285+
286+
chclient = clickhouse_connect.get_client(compress=False) # ClickHouse credentials here
287+
288+
while True:
289+
# Take the search query from user
290+
print("Enter a search topic :")
291+
input_query = sys.stdin.readline();
292+
texts = [input_query]
293+
294+
# Run the model and obtain search or reference vector
295+
print("Generating the embedding for ----> ", input_query);
296+
embeddings = model.encode(texts)
297+
298+
print("Querying ClickHouse...")
299+
params = {'v1':list(embeddings[0]), 'v2':100}
300+
result = chclient.query("SELECT id,title,text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)
301+
302+
# Just join all the search results
303+
doc_results = ""
304+
for row in result.result_rows:
305+
doc_results = doc_results + "\n" + row[2]
306+
307+
print("Initializing chatgpt-3.5-turbo model")
308+
model_name = "gpt-3.5-turbo"
309+
310+
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
311+
model_name=model_name
312+
)
313+
314+
texts = text_splitter.split_text(doc_results)
315+
316+
docs = [Document(page_content=t) for t in texts]
317+
318+
llm = ChatOpenAI(temperature=0, model_name=model_name)
319+
320+
prompt_template = """
321+
Write a concise summary of the following in not more than 10 sentences:
322+
323+
324+
{text}
325+
326+
327+
CONSCISE SUMMARY :
328+
"""
329+
330+
prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
331+
332+
num_tokens = num_tokens_from_string(doc_results, model_name)
333+
334+
gpt_35_turbo_max_tokens = 4096
335+
verbose = False
336+
337+
print("Summarizing search results retrieved from ClickHouse...")
338+
339+
if num_tokens <= gpt_35_turbo_max_tokens:
340+
chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt, verbose=verbose)
341+
else:
342+
chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt=prompt, combine_prompt=prompt, verbose=verbose)
343+
344+
summary = chain.run(docs)
345+
346+
print(f"Summary from chatgpt-3.5: {summary}")
347+
```
348+
</VerticalStepper>

0 commit comments

Comments
 (0)