|
| 1 | +--- |
| 2 | +description: 'Dataset containing 28+ million Hacker News postings & their vector embeddings' |
| 3 | +sidebar_label: 'Hacker News Vector Search dataset' |
| 4 | +slug: /getting-started/example-datasets/hackernews-vector-search-dataset |
| 5 | +title: 'Hacker News Vector Search dataset' |
| 6 | +keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings'] |
| 7 | +--- |
| 8 | + |
| 9 | +## Introduction {#introduction} |
| 10 | + |
| 11 | +The [Hacker News dataset](https://news.ycombinator.com/) contains 28.74 million |
| 12 | +postings and their vector embeddings. The embeddings were generated using [SentenceTransformers](https://sbert.net/) model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). The dimension of each embedding vector is `384`. |
| 13 | + |
| 14 | +This dataset can be used to walk through the design, sizing and performance aspects for a large scale, |
| 15 | +real world vector search application built on top of user generated, textual data. |
| 16 | + |
| 17 | +## Dataset details {#dataset-details} |
| 18 | + |
| 19 | +The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a [S3 bucket](https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet) |
| 20 | + |
| 21 | +We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md). |
| 22 | + |
| 23 | +## Steps {#steps} |
| 24 | + |
| 25 | +<VerticalStepper headerLevel="h3"> |
| 26 | + |
| 27 | +### Create table {#create-table} |
| 28 | + |
| 29 | +Create the `hackernews` table to store the postings & their embeddings and associated attributes: |
| 30 | + |
| 31 | +```sql |
| 32 | +CREATE TABLE hackernews |
| 33 | +( |
| 34 | + `id` Int32, |
| 35 | + `doc_id` Int32, |
| 36 | + `text` String, |
| 37 | + `vector` Array(Float32), |
| 38 | + `node_info` Tuple( |
| 39 | + start Nullable(UInt64), |
| 40 | + end Nullable(UInt64)), |
| 41 | + `metadata` String, |
| 42 | + `type` Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5), |
| 43 | + `by` LowCardinality(String), |
| 44 | + `time` DateTime, |
| 45 | + `title` String, |
| 46 | + `post_score` Int32, |
| 47 | + `dead` UInt8, |
| 48 | + `deleted` UInt8, |
| 49 | + `length` UInt32 |
| 50 | +) |
| 51 | +ENGINE = MergeTree |
| 52 | +ORDER BY id; |
| 53 | +``` |
| 54 | + |
| 55 | +The `id` is just an incrementing integer. The additional attributes can be used in predicates to understand |
| 56 | +vector similarity search combined with post-filtering/pre-filtering as explained in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md) |
| 57 | + |
| 58 | +### Load data {#load-table} |
| 59 | + |
| 60 | +To load the dataset from the `Parquet` file, run the following SQL statement: |
| 61 | + |
| 62 | +```sql |
| 63 | +INSERT INTO hackernews SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet'); |
| 64 | +``` |
| 65 | + |
| 66 | +Inserting 28.74 million rows into the table will take a few minutes. |
| 67 | + |
| 68 | +### Build a vector similarity index {#build-vector-similarity-index} |
| 69 | + |
| 70 | +Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table: |
| 71 | + |
| 72 | +```sql |
| 73 | +ALTER TABLE hackernews ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 384, 'bf16', 64, 512); |
| 74 | + |
| 75 | +ALTER TABLE hackernews MATERIALIZE INDEX vector_index SETTINGS mutations_sync = 2; |
| 76 | +``` |
| 77 | + |
| 78 | +The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md). |
| 79 | +The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`. |
| 80 | +Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality |
| 81 | +corresponding to selected values. |
| 82 | + |
| 83 | +Building and saving the index could even take a few minutes/hour for the full 28.74 million dataset, depending on the number of CPU cores available and the storage bandwidth. |
| 84 | + |
| 85 | +### Perform ANN search {#perform-ann-search} |
| 86 | + |
| 87 | +Once the vector similarity index has been built, vector search queries will automatically use the index: |
| 88 | + |
| 89 | +```sql title="Query" |
| 90 | +SELECT id, title, text |
| 91 | +FROM hackernews |
| 92 | +ORDER BY cosineDistance( vector, <search vector>) |
| 93 | +LIMIT 10 |
| 94 | + |
| 95 | +``` |
| 96 | + |
| 97 | +The first time load of the vector index into memory could take a few seconds/minutes. |
| 98 | + |
| 99 | +### Generate embeddings for search query {#generating-embeddings-for-search-query} |
| 100 | + |
| 101 | +[Sentence Transformers](https://www.sbert.net/) provide local, easy to use embedding |
| 102 | +models for capturing the semantic meaning of sentences and paragraphs. |
| 103 | + |
| 104 | +The dataset in this HackerNews dataset contains vector emebeddings generated from the |
| 105 | +[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. |
| 106 | + |
| 107 | +An example Python script is provided below to demonstrate how to programmatically generate |
| 108 | +embedding vectors using `sentence_transformers1 Python package. The search embedding vector |
| 109 | +is then passed as an argument to the [`cosineDistance()`](/sql-reference/functions/distance-functions#cosineDistance) function in the `SELECT` query. |
| 110 | + |
| 111 | +```python |
| 112 | +from sentence_transformers import SentenceTransformer |
| 113 | +import sys |
| 114 | + |
| 115 | +import clickhouse_connect |
| 116 | + |
| 117 | +print("Initializing...") |
| 118 | + |
| 119 | +model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') |
| 120 | + |
| 121 | +chclient = clickhouse_connect.get_client() # ClickHouse credentials here |
| 122 | + |
| 123 | +while True: |
| 124 | + # Take the search query from user |
| 125 | + print("Enter a search query :") |
| 126 | + input_query = sys.stdin.readline(); |
| 127 | + texts = [input_query] |
| 128 | + |
| 129 | + # Run the model and obtain search vector |
| 130 | + print("Generating the embedding for ", input_query); |
| 131 | + embeddings = model.encode(texts) |
| 132 | + |
| 133 | + print("Querying ClickHouse...") |
| 134 | + params = {'v1':list(embeddings[0]), 'v2':20} |
| 135 | + result = chclient.query("SELECT id, title, text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params) |
| 136 | + print("Results :") |
| 137 | + for row in result.result_rows: |
| 138 | + print(row[0], row[2][:100]) |
| 139 | + print("---------") |
| 140 | + |
| 141 | +``` |
| 142 | + |
| 143 | +An example of running the above Python script and similarity search results are shown below |
| 144 | +(only 100 characters from each of the top 20 posts are printed): |
| 145 | + |
| 146 | +```text |
| 147 | +Initializing... |
| 148 | +
|
| 149 | +Enter a search query : |
| 150 | +Are OLAP cubes useful |
| 151 | +
|
| 152 | +Generating the embedding for "Are OLAP cubes useful" |
| 153 | +
|
| 154 | +Querying ClickHouse... |
| 155 | +
|
| 156 | +Results : |
| 157 | +
|
| 158 | +27742647 smartmic: |
| 159 | +slt2021: OLAP Cube is not dead, as long as you use some form of:<p>1. GROUP BY multiple fi |
| 160 | +--------- |
| 161 | +27744260 georgewfraser:A data mart is a logical organization of data to help humans understand the schema. Wh |
| 162 | +--------- |
| 163 | +27761434 mwexler:"We model data according to rigorous frameworks like Kimball or Inmon because we must r |
| 164 | +--------- |
| 165 | +28401230 chotmat: |
| 166 | +erosenbe0: OLAP database is just a copy, replica, or archive of data with a schema designe |
| 167 | +--------- |
| 168 | +22198879 Merick:+1 for Apache Kylin, it's a great project and awesome open source community. If anyone i |
| 169 | +--------- |
| 170 | +27741776 crazydoggers:I always felt the value of an OLAP cube was uncovering questions you may not know to as |
| 171 | +--------- |
| 172 | +22189480 shadowsun7: |
| 173 | +_Codemonkeyism: After maintaining an OLAP cube system for some years, I'm not that |
| 174 | +--------- |
| 175 | +27742029 smartmic: |
| 176 | +gengstrand: My first exposure to OLAP was on a team developing a front end to Essbase that |
| 177 | +--------- |
| 178 | +22364133 irfansharif: |
| 179 | +simo7: I'm wondering how this technology could work for OLAP cubes.<p>An OLAP cube |
| 180 | +--------- |
| 181 | +23292746 scoresmoke:When I was developing my pet project for Web analytics (<a href="https://github |
| 182 | +--------- |
| 183 | +22198891 js8:It seems that the article makes a categorical error, arguing that OLAP cubes were replaced by co |
| 184 | +--------- |
| 185 | +28421602 chotmat: |
| 186 | +7thaccount: Is there any advantage to OLAP cube over plain SQL (large historical database r |
| 187 | +--------- |
| 188 | +22195444 shadowsun7: |
| 189 | +lkcubing: Thanks for sharing. Interesting write up.<p>While this article accurately capt |
| 190 | +--------- |
| 191 | +22198040 lkcubing:Thanks for sharing. Interesting write up.<p>While this article accurately captures the issu |
| 192 | +--------- |
| 193 | +3973185 stefanu: |
| 194 | +sgt: Interesting idea. Ofcourse, OLAP isn't just about the underlying cubes and dimensions, |
| 195 | +--------- |
| 196 | +22190903 shadowsun7: |
| 197 | +js8: It seems that the article makes a categorical error, arguing that OLAP cubes were r |
| 198 | +--------- |
| 199 | +28422241 sradman:OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the history of |
| 200 | +--------- |
| 201 | +28421480 chotmat: |
| 202 | +sradman: OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the |
| 203 | +--------- |
| 204 | +27742515 BadInformatics: |
| 205 | +quantified: OP posts with inverted condition: “OLAP != OLAP Cube” is the actual titl |
| 206 | +--------- |
| 207 | +28422935 chotmat: |
| 208 | +rstuart4133: I remember hearing about OLAP cubes donkey's years ago (probably not far |
| 209 | +--------- |
| 210 | +``` |
| 211 | + |
| 212 | +## Summarization demo application {#summarization-demo-application} |
| 213 | + |
| 214 | +The example above demonstrated semantic search and document retrieval using ClickHouse. |
| 215 | + |
| 216 | +A very simple but high potential generative AI example application is presented next. |
| 217 | + |
| 218 | +The application performs the following steps: |
| 219 | + |
| 220 | +1. Accepts a _topic_ as input from the user |
| 221 | +2. Generates an embedding vector for the _topic_ by using the `SentenceTransformers` with model `all-MiniLM-L6-v2` |
| 222 | +3. Retrieves highly relevant posts/comments using vector similarity search on the `hackernews` table |
| 223 | +4. Uses `LangChain` and OpenAI `gpt-3.5-turbo` Chat API to **summarize** the content retrieved in step #3. |
| 224 | + The posts/comments retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI. |
| 225 | + |
| 226 | +An example from running the summarization application is first listed below, followed by the code |
| 227 | +for the summarization application. Running the application requires an OpenAI API key to be set in the environment |
| 228 | +variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openai.com. |
| 229 | + |
| 230 | +This application demonstrates a Generative AI use-case that is applicable to multiple enterprise domains like : |
| 231 | +customer sentiment analysis, technical support automation, mining user conversations, legal documents, medical records, |
| 232 | +meeting transcripts, financial statements, etc |
| 233 | + |
| 234 | +```shell |
| 235 | +$ python3 summarize.py |
| 236 | + |
| 237 | +Enter a search topic : |
| 238 | +ClickHouse performance experiences |
| 239 | + |
| 240 | +Generating the embedding for ----> ClickHouse performance experiences |
| 241 | + |
| 242 | +Querying ClickHouse to retrieve relevant articles... |
| 243 | + |
| 244 | +Initializing chatgpt-3.5-turbo model... |
| 245 | + |
| 246 | +Summarizing search results retrieved from ClickHouse... |
| 247 | + |
| 248 | +Summary from chatgpt-3.5: |
| 249 | +The discussion focuses on comparing ClickHouse with various databases like TimescaleDB, Apache Spark, |
| 250 | +AWS Redshift, and QuestDB, highlighting ClickHouse's cost-efficient high performance and suitability |
| 251 | +for analytical applications. Users praise ClickHouse for its simplicity, speed, and resource efficiency |
| 252 | +in handling large-scale analytics workloads, although some challenges like DMLs and difficulty in backups |
| 253 | +are mentioned. ClickHouse is recognized for its real-time aggregate computation capabilities and solid |
| 254 | +engineering, with comparisons made to other databases like Druid and MemSQL. Overall, ClickHouse is seen |
| 255 | +as a powerful tool for real-time data processing, analytics, and handling large volumes of data |
| 256 | +efficiently, gaining popularity for its impressive performance and cost-effectiveness. |
| 257 | +``` |
| 258 | +
|
| 259 | +Code for the above application : |
| 260 | +
|
| 261 | +```python |
| 262 | +print("Initializing...") |
| 263 | +
|
| 264 | +import sys |
| 265 | +import json |
| 266 | +import time |
| 267 | +from sentence_transformers import SentenceTransformer |
| 268 | +
|
| 269 | +import clickhouse_connect |
| 270 | +
|
| 271 | +from langchain.docstore.document import Document |
| 272 | +from langchain.text_splitter import CharacterTextSplitter |
| 273 | +from langchain.chat_models import ChatOpenAI |
| 274 | +from langchain.prompts import PromptTemplate |
| 275 | +from langchain.chains.summarize import load_summarize_chain |
| 276 | +import textwrap |
| 277 | +import tiktoken |
| 278 | +
|
| 279 | +def num_tokens_from_string(string: str, encoding_name: str) -> int: |
| 280 | + encoding = tiktoken.encoding_for_model(encoding_name) |
| 281 | + num_tokens = len(encoding.encode(string)) |
| 282 | + return num_tokens |
| 283 | +
|
| 284 | +model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') |
| 285 | +
|
| 286 | +chclient = clickhouse_connect.get_client(compress=False) # ClickHouse credentials here |
| 287 | +
|
| 288 | +while True: |
| 289 | + # Take the search query from user |
| 290 | + print("Enter a search topic :") |
| 291 | + input_query = sys.stdin.readline(); |
| 292 | + texts = [input_query] |
| 293 | +
|
| 294 | + # Run the model and obtain search or reference vector |
| 295 | + print("Generating the embedding for ----> ", input_query); |
| 296 | + embeddings = model.encode(texts) |
| 297 | +
|
| 298 | + print("Querying ClickHouse...") |
| 299 | + params = {'v1':list(embeddings[0]), 'v2':100} |
| 300 | + result = chclient.query("SELECT id,title,text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params) |
| 301 | +
|
| 302 | + # Just join all the search results |
| 303 | + doc_results = "" |
| 304 | + for row in result.result_rows: |
| 305 | + doc_results = doc_results + "\n" + row[2] |
| 306 | +
|
| 307 | + print("Initializing chatgpt-3.5-turbo model") |
| 308 | + model_name = "gpt-3.5-turbo" |
| 309 | +
|
| 310 | + text_splitter = CharacterTextSplitter.from_tiktoken_encoder( |
| 311 | + model_name=model_name |
| 312 | + ) |
| 313 | +
|
| 314 | + texts = text_splitter.split_text(doc_results) |
| 315 | +
|
| 316 | + docs = [Document(page_content=t) for t in texts] |
| 317 | +
|
| 318 | + llm = ChatOpenAI(temperature=0, model_name=model_name) |
| 319 | +
|
| 320 | + prompt_template = """ |
| 321 | +Write a concise summary of the following in not more than 10 sentences: |
| 322 | +
|
| 323 | +
|
| 324 | +{text} |
| 325 | +
|
| 326 | +
|
| 327 | +CONSCISE SUMMARY : |
| 328 | +""" |
| 329 | +
|
| 330 | + prompt = PromptTemplate(template=prompt_template, input_variables=["text"]) |
| 331 | +
|
| 332 | + num_tokens = num_tokens_from_string(doc_results, model_name) |
| 333 | +
|
| 334 | + gpt_35_turbo_max_tokens = 4096 |
| 335 | + verbose = False |
| 336 | +
|
| 337 | + print("Summarizing search results retrieved from ClickHouse...") |
| 338 | +
|
| 339 | + if num_tokens <= gpt_35_turbo_max_tokens: |
| 340 | + chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt, verbose=verbose) |
| 341 | + else: |
| 342 | + chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt=prompt, combine_prompt=prompt, verbose=verbose) |
| 343 | +
|
| 344 | + summary = chain.run(docs) |
| 345 | +
|
| 346 | + print(f"Summary from chatgpt-3.5: {summary}") |
| 347 | +``` |
| 348 | +</VerticalStepper> |
0 commit comments