Skip to content

Commit e2f15e9

Browse files
authored
Merge pull request #4075 from ClickHouse/datagen_docs
Generating random data guide
2 parents 8ab041b + a63fd33 commit e2f15e9

File tree

2 files changed

+257
-0
lines changed

2 files changed

+257
-0
lines changed
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
---
2+
sidebar_label: 'Generating random test data'
3+
title: 'Generating random test data in ClickHouse'
4+
slug: /guides/generating-test-data
5+
description: 'Learn about Generating Random Test Data in ClickHouse'
6+
show_related_blogs: true
7+
doc_type: 'guide'
8+
keywords: ['random data', 'test data']
9+
---
10+
11+
# Generating random test data in ClickHouse
12+
13+
Generating random data is useful when testing new use cases or benchmarking your implementation.
14+
ClickHouse has a [wide range of functions for generating random data](/sql-reference/functions/random-functions) that, in many cases, avoid the need for an external data generator.
15+
16+
This guide provides several examples of how to generate random datasets in ClickHouse with different randomness requirements.
17+
18+
## Simple uniform dataset {#simple-uniform-dataset}
19+
20+
**Use-case**: Generate a quick dataset of user events with random timestamps and event types.
21+
22+
```sql
23+
CREATE TABLE user_events (
24+
event_id UUID,
25+
user_id UInt32,
26+
event_type LowCardinality(String),
27+
event_time DateTime
28+
) ENGINE = MergeTree
29+
ORDER BY event_time;
30+
31+
INSERT INTO user_events
32+
SELECT
33+
generateUUIDv4() AS event_id,
34+
rand() % 10000 AS user_id,
35+
arrayJoin(['click','view','purchase']) AS event_type,
36+
now() - INTERVAL rand() % 3600*24 SECOND AS event_time
37+
FROM numbers(1000000);
38+
```
39+
40+
* `rand() % 10000`: uniform distribution of 10k users
41+
* `arrayJoin(...)`: randomly selects one of three event types
42+
* Timestamps spread over the previous 24 hours
43+
44+
---
45+
46+
## Exponential distribution {#exponential-distribution}
47+
48+
**Use-case**: Simulate purchase amounts where most values are low, but a few are high.
49+
50+
```sql
51+
CREATE TABLE purchases (
52+
dt DateTime,
53+
customer_id UInt32,
54+
total_spent Float32
55+
) ENGINE = MergeTree
56+
ORDER BY dt;
57+
58+
INSERT INTO purchases
59+
SELECT
60+
now() - INTERVAL randUniform(1,1_000_000) SECOND AS dt,
61+
number AS customer_id,
62+
15 + round(randExponential(1/10), 2) AS total_spent
63+
FROM numbers(500000);
64+
```
65+
66+
* Uniform timestamps over recent period
67+
* `randExponential(1/10)` — most totals near 0, offset by 15 as a minimum ([ClickHouse][1], [ClickHouse][2], [Atlantic.Net][3], [GitHub][4])
68+
69+
---
70+
71+
## Time-distributed events (Poisson) {#poisson-distribution}
72+
73+
**Use-case**: Simulate event arrivals that cluster around a specific period (e.g., peak hour).
74+
75+
```sql
76+
CREATE TABLE events (
77+
dt DateTime,
78+
event_type String
79+
) ENGINE = MergeTree
80+
ORDER BY dt;
81+
82+
INSERT INTO events
83+
SELECT
84+
toDateTime('2022-12-12 12:00:00')
85+
- ((12 + randPoisson(12)) * 3600) AS dt,
86+
'click' AS event_type
87+
FROM numbers(200000);
88+
```
89+
90+
* Events peak around noon, with Poisson-distributed deviation
91+
92+
---
93+
94+
## Time-varying normal distribution {#time-varying-normal-distribution}
95+
96+
**Use-case**: Emulate system metrics (e.g., CPU usage) that vary over time.
97+
98+
```sql
99+
CREATE TABLE cpu_metrics (
100+
host String,
101+
ts DateTime,
102+
usage Float32
103+
) ENGINE = MergeTree
104+
ORDER BY (host, ts);
105+
106+
INSERT INTO cpu_metrics
107+
SELECT
108+
arrayJoin(['host1','host2','host3']) AS host,
109+
now() - INTERVAL number SECOND AS ts,
110+
greatest(0.0, least(100.0,
111+
randNormal(50 + 30*sin(toUInt32(ts)%86400/86400*2*pi()), 10)
112+
)) AS usage
113+
FROM numbers(10000);
114+
```
115+
116+
* `usage` follows a diurnal sine wave + randomness
117+
* Values bounded to \[0,100]
118+
119+
---
120+
121+
## Categorical and nested data {#categorical-and-nested-data}
122+
123+
**Use-case**: Create user profiles with multi-valued interests.
124+
125+
```sql
126+
CREATE TABLE user_profiles (
127+
user_id UInt32,
128+
interests Array(String),
129+
scores Array(UInt8)
130+
) ENGINE = MergeTree
131+
ORDER BY user_id;
132+
133+
INSERT INTO user_profiles
134+
SELECT
135+
number AS user_id,
136+
arrayShuffle(['sports','music','tech'])[1 + rand() % 3 : 1 + rand() % 3] AS interests,
137+
[rand() % 100, rand() % 100, rand() % 100] AS scores
138+
FROM numbers(20000);
139+
```
140+
141+
* Random array length between 1–3
142+
* Three per-user scores for each interest
143+
144+
:::tip
145+
Read the [Generating Random Data in ClickHouse](https://clickhouse.com/blog/generating-random-test-distribution-data-for-clickhouse) blog for even more examples.
146+
:::
147+
148+
## Generating random tables {#generating-random-tables}
149+
150+
The [`generateRandomStructure`](/sql-reference/functions/other-functions#generaterandomstructure) function is particularly useful when combined with the [`generateRandom`](/sql-reference/table-functions/generate) table engine for testing, benchmarking, or creating mock data with arbitrary schemas.
151+
152+
Let's start by just seeing what a random structure looks like using the `generateRandomStructure` function:
153+
154+
```sql
155+
SELECT generateRandomStructure(5);
156+
```
157+
158+
You might see something like:
159+
160+
```response
161+
c1 UInt32, c2 Array(String), c3 DateTime, c4 Nullable(Float64), c5 Map(String, Int16)
162+
```
163+
164+
You can also use a seed to get the same structure every time:
165+
166+
```sql
167+
SELECT generateRandomStructure(3, 42);
168+
```
169+
170+
```response
171+
c1 String, c2 Array(Nullable(Int32)), c3 Tuple(UInt8, Date)
172+
```
173+
174+
Now let's create an actual table and fill it with random data:
175+
176+
```sql
177+
CREATE TABLE my_test_table
178+
ENGINE = MergeTree
179+
ORDER BY tuple()
180+
AS SELECT *
181+
FROM generateRandom(
182+
'col1 UInt32, col2 String, col3 Float64, col4 DateTime',
183+
1, -- seed for data generation
184+
10 -- number of different random values
185+
)
186+
LIMIT 100; -- 100 rows
187+
188+
-- Step 2: Query your new table
189+
SELECT * FROM my_test_table LIMIT 5;
190+
```
191+
192+
```response
193+
┌───────col1─┬─col2──────┬─────────────────────col3─┬────────────────col4─┐
194+
│ 4107652264 │ &b!M-e;7 │ 1.0013455832230728e-158 │ 2059-08-14 19:03:26 │
195+
│ 652895061 │ Dj7peUH{T │ -1.032074207667996e112 │ 2079-10-06 04:18:16 │
196+
│ 2319105779 │ =D[ │ -2.066555415720528e88 │ 2015-04-26 11:44:13 │
197+
│ 1835960063 │ _@}a │ -1.4998020545039013e110 │ 2063-03-03 20:36:55 │
198+
│ 730412674 │ _}! │ -1.3578492992094465e-275 │ 2098-08-23 18:23:37 │
199+
└────────────┴───────────┴──────────────────────────┴─────────────────────┘
200+
```
201+
202+
Let's combine both functions for a completely random table.
203+
First, see what structure we'll get:
204+
205+
```sql
206+
SELECT generateRandomStructure(7, 123) AS structure FORMAT vertical;
207+
```
208+
209+
```response
210+
┌─structure──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
211+
│ c1 Decimal64(7), c2 Enum16('c2V0' = -21744, 'c2V1' = 5380), c3 Int8, c4 UUID, c5 UUID, c6 FixedString(190), c7 Map(Enum16('c7V0' = -19581, 'c7V1' = -10024, 'c7V2' = 27615, 'c7V3' = -10177, 'c7V4' = -19644, 'c7V5' = 3554, 'c7V6' = 29073, 'c7V7' = 28800, 'c7V8' = -11512), Float64) │
212+
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
213+
```
214+
215+
Now create the table with that structure and use the `DESCRIBE` statement to see what we created:
216+
217+
```sql
218+
CREATE TABLE fully_random_table
219+
ENGINE = MergeTree
220+
ORDER BY tuple()
221+
AS SELECT *
222+
FROM generateRandom(generateRandomStructure(7, 123), 1, 10)
223+
LIMIT 1000;
224+
225+
DESCRIBE TABLE fully_random_table;
226+
```
227+
228+
```response
229+
┌─name─┬─type─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
230+
1. │ c1 │ Decimal(18, 7) │ │ │ │ │ │
231+
2. │ c2 │ Enum16('c2V0' = -21744, 'c2V1' = 5380) │ │ │ │ │ │
232+
3. │ c3 │ Int8 │ │ │ │ │ │
233+
4. │ c4 │ UUID │ │ │ │ │ │
234+
5. │ c5 │ UUID │ │ │ │ │ │
235+
6. │ c6 │ FixedString(190) │ │ │ │ │ │
236+
7. │ c7 │ Map(Enum16('c7V4' = -19644, 'c7V0' = -19581, 'c7V8' = -11512, 'c7V3' = -10177, 'c7V1' = -10024, 'c7V5' = 3554, 'c7V2' = 27615, 'c7V7' = 28800, 'c7V6' = 29073), Float64) │ │ │ │ │ │
237+
└──────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
238+
```
239+
240+
Inspect the first row for a sample of the generated data:
241+
242+
```sql
243+
SELECT * FROM fully_random_table LIMIT 1 FORMAT vertical;
244+
```
245+
246+
```response
247+
Row 1:
248+
──────
249+
c1: 80416293882.257732 -- 80.42 billion
250+
c2: c2V1
251+
c3: -84
252+
c4: 1a9429b3-fd8b-1d72-502f-c051aeb7018e
253+
c5: 7407421a-031f-eb3b-8571-44ff279ddd36
254+
c6: g̅b�&��rҵ���5C�\�|��H�>���l'V3��R�[��=3�G�LwVMR*s緾/2�J.���6#��(�h>�lە��L^�M�:�R�9%d�ž�zv��W����Y�S��_no��BP+��u��.0��UZ!x�@7:�nj%3�Λd�S�k>���w��|�&��~
255+
c7: {'c7V8':-1.160941256852442}
256+
```

sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@ const sidebars = {
8484
"guides/inserting-data",
8585
"guides/writing-queries",
8686
"guides/developer/mutations",
87+
"guides/generating-test-data"
8788
],
8889
},
8990
{

0 commit comments

Comments
 (0)