Skip to content

Commit 2837bd0

Browse files
committed
init random data guide
1 parent 0a87c87 commit 2837bd0

File tree

1 file changed

+143
-0
lines changed

1 file changed

+143
-0
lines changed
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
sidebar_label: 'Generating random test data'
3+
title: 'Generating random test data in ClickHouse'
4+
slug: /guides/generating-test-data
5+
description: 'Learn about Generating Random Test Data in ClickHouse'
6+
show_related_blogs: true
7+
---
8+
9+
# Generating Random Test Data in ClickHouse
10+
11+
Generating random data is useful when testing new use cases or benchmarking your implementation. ClickHouse has a [wide range of functions for generating random data](/sql-reference/functions/random-functions) that, in many cases, avoid the need for an external data generator.
12+
13+
This guide provides several examples of how to generate random datasets in ClickHouse with different randomness requirements.
14+
15+
## Simple Uniform Dataset
16+
17+
**Use-case**: Generate a quick dataset of user events with random timestamps and event types.
18+
19+
```sql
20+
CREATE TABLE user_events (
21+
event_id UUID,
22+
user_id UInt32,
23+
event_type LowCardinality(String),
24+
event_time DateTime
25+
) ENGINE = MergeTree
26+
ORDER BY event_time;
27+
28+
INSERT INTO user_events
29+
SELECT
30+
generateUUIDv4() AS event_id,
31+
rand() % 10000 AS user_id,
32+
arrayJoin(['click','view','purchase']) AS event_type,
33+
now() - INTERVAL rand() % 3600*24 SECOND AS event_time
34+
FROM numbers(1000000);
35+
```
36+
37+
* `rand() % 10000`: uniform distribution of 10k users
38+
* `arrayJoin(...)`: randomly selects one of three event types
39+
* Timestamps spread over the previous 24 hours
40+
41+
---
42+
43+
## Exponential Distribution
44+
45+
**Use-case**: Simulate purchase amounts where most values are low, but a few are high.
46+
47+
```sql
48+
CREATE TABLE purchases (
49+
dt DateTime,
50+
customer_id UInt32,
51+
total_spent Float32
52+
) ENGINE = MergeTree
53+
ORDER BY dt;
54+
55+
INSERT INTO purchases
56+
SELECT
57+
now() - INTERVAL randUniform(1,1_000_000) SECOND AS dt,
58+
number AS customer_id,
59+
15 + round(randExponential(1/10), 2) AS total_spent
60+
FROM numbers(500000);
61+
```
62+
63+
* Uniform timestamps over recent period
64+
* `randExponential(1/10)` — most totals near 0, offset by 15 as a minimum ([ClickHouse][1], [ClickHouse][2], [Atlantic.Net][3], [GitHub][4])
65+
66+
---
67+
68+
## Time-Distributed Events (Poisson)
69+
70+
**Use-case**: Simulate event arrivals that cluster around a specific period (e.g., peak hour).
71+
72+
```sql
73+
CREATE TABLE events (
74+
dt DateTime,
75+
event_type String
76+
) ENGINE = MergeTree
77+
ORDER BY dt;
78+
79+
INSERT INTO events
80+
SELECT
81+
toDateTime('2022-12-12 12:00:00')
82+
- ((12 + randPoisson(12)) * 3600) AS dt,
83+
'click' AS event_type
84+
FROM numbers(200000);
85+
```
86+
87+
* Events peak around noon, with Poisson-distributed deviation
88+
89+
---
90+
91+
## Time-Varying Normal Distribution
92+
93+
**Use-case**: Emulate system metrics (e.g., CPU usage) that vary over time.
94+
95+
```sql
96+
CREATE TABLE cpu_metrics (
97+
host String,
98+
ts DateTime,
99+
usage Float32
100+
) ENGINE = MergeTree
101+
ORDER BY (host, ts);
102+
103+
INSERT INTO cpu_metrics
104+
SELECT
105+
arrayJoin(['host1','host2','host3']) AS host,
106+
now() - INTERVAL number SECOND AS ts,
107+
greatest(0.0, least(100.0,
108+
randNormal(50 + 30*sin(toUInt32(ts)%86400/86400*2*pi()), 10)
109+
)) AS usage
110+
FROM numbers(10000);
111+
```
112+
113+
* `usage` follows a diurnal sine wave + randomness
114+
* Values bounded to \[0,100]
115+
116+
---
117+
118+
## Categorical & Nested Data
119+
120+
**Use-case**: Create user profiles with multi-valued interests.
121+
122+
```sql
123+
CREATE TABLE user_profiles (
124+
user_id UInt32,
125+
interests Array(String),
126+
scores Array(UInt8)
127+
) ENGINE = MergeTree
128+
ORDER BY user_id;
129+
130+
INSERT INTO user_profiles
131+
SELECT
132+
number AS user_id,
133+
arrayShuffle(['sports','music','tech'])[1 + rand() % 3 : 1 + rand() % 3] AS interests,
134+
[rand() % 100, rand() % 100, rand() % 100] AS scores
135+
FROM numbers(20000);
136+
```
137+
138+
* Random array length between 1–3
139+
* Three per-user scores for each interest
140+
141+
:::tip
142+
Read the [Generating Random Data in ClickHouse](https://clickhouse.com/blog/generating-random-test-distribution-data-for-clickhouse) blog for even more examples.
143+
:::

0 commit comments

Comments
 (0)