|
| 1 | +--- |
| 2 | +sidebar_label: 'Generating random test data' |
| 3 | +title: 'Generating random test data in ClickHouse' |
| 4 | +slug: /guides/generating-test-data |
| 5 | +description: 'Learn about Generating Random Test Data in ClickHouse' |
| 6 | +show_related_blogs: true |
| 7 | +--- |
| 8 | + |
| 9 | +# Generating Random Test Data in ClickHouse |
| 10 | + |
| 11 | +Generating random data is useful when testing new use cases or benchmarking your implementation. ClickHouse has a [wide range of functions for generating random data](/sql-reference/functions/random-functions) that, in many cases, avoid the need for an external data generator. |
| 12 | + |
| 13 | +This guide provides several examples of how to generate random datasets in ClickHouse with different randomness requirements. |
| 14 | + |
| 15 | +## Simple Uniform Dataset |
| 16 | + |
| 17 | +**Use-case**: Generate a quick dataset of user events with random timestamps and event types. |
| 18 | + |
| 19 | +```sql |
| 20 | +CREATE TABLE user_events ( |
| 21 | + event_id UUID, |
| 22 | + user_id UInt32, |
| 23 | + event_type LowCardinality(String), |
| 24 | + event_time DateTime |
| 25 | +) ENGINE = MergeTree |
| 26 | +ORDER BY event_time; |
| 27 | + |
| 28 | +INSERT INTO user_events |
| 29 | +SELECT |
| 30 | + generateUUIDv4() AS event_id, |
| 31 | + rand() % 10000 AS user_id, |
| 32 | + arrayJoin(['click','view','purchase']) AS event_type, |
| 33 | + now() - INTERVAL rand() % 3600*24 SECOND AS event_time |
| 34 | +FROM numbers(1000000); |
| 35 | +``` |
| 36 | + |
| 37 | +* `rand() % 10000`: uniform distribution of 10k users |
| 38 | +* `arrayJoin(...)`: randomly selects one of three event types |
| 39 | +* Timestamps spread over the previous 24 hours |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## Exponential Distribution |
| 44 | + |
| 45 | +**Use-case**: Simulate purchase amounts where most values are low, but a few are high. |
| 46 | + |
| 47 | +```sql |
| 48 | +CREATE TABLE purchases ( |
| 49 | + dt DateTime, |
| 50 | + customer_id UInt32, |
| 51 | + total_spent Float32 |
| 52 | +) ENGINE = MergeTree |
| 53 | +ORDER BY dt; |
| 54 | + |
| 55 | +INSERT INTO purchases |
| 56 | +SELECT |
| 57 | + now() - INTERVAL randUniform(1,1_000_000) SECOND AS dt, |
| 58 | + number AS customer_id, |
| 59 | + 15 + round(randExponential(1/10), 2) AS total_spent |
| 60 | +FROM numbers(500000); |
| 61 | +``` |
| 62 | + |
| 63 | +* Uniform timestamps over recent period |
| 64 | +* `randExponential(1/10)` — most totals near 0, offset by 15 as a minimum ([ClickHouse][1], [ClickHouse][2], [Atlantic.Net][3], [GitHub][4]) |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## Time-Distributed Events (Poisson) |
| 69 | + |
| 70 | +**Use-case**: Simulate event arrivals that cluster around a specific period (e.g., peak hour). |
| 71 | + |
| 72 | +```sql |
| 73 | +CREATE TABLE events ( |
| 74 | + dt DateTime, |
| 75 | + event_type String |
| 76 | +) ENGINE = MergeTree |
| 77 | +ORDER BY dt; |
| 78 | + |
| 79 | +INSERT INTO events |
| 80 | +SELECT |
| 81 | + toDateTime('2022-12-12 12:00:00') |
| 82 | + - ((12 + randPoisson(12)) * 3600) AS dt, |
| 83 | + 'click' AS event_type |
| 84 | +FROM numbers(200000); |
| 85 | +``` |
| 86 | + |
| 87 | +* Events peak around noon, with Poisson-distributed deviation |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +## Time-Varying Normal Distribution |
| 92 | + |
| 93 | +**Use-case**: Emulate system metrics (e.g., CPU usage) that vary over time. |
| 94 | + |
| 95 | +```sql |
| 96 | +CREATE TABLE cpu_metrics ( |
| 97 | + host String, |
| 98 | + ts DateTime, |
| 99 | + usage Float32 |
| 100 | +) ENGINE = MergeTree |
| 101 | +ORDER BY (host, ts); |
| 102 | + |
| 103 | +INSERT INTO cpu_metrics |
| 104 | +SELECT |
| 105 | + arrayJoin(['host1','host2','host3']) AS host, |
| 106 | + now() - INTERVAL number SECOND AS ts, |
| 107 | + greatest(0.0, least(100.0, |
| 108 | + randNormal(50 + 30*sin(toUInt32(ts)%86400/86400*2*pi()), 10) |
| 109 | + )) AS usage |
| 110 | +FROM numbers(10000); |
| 111 | +``` |
| 112 | + |
| 113 | +* `usage` follows a diurnal sine wave + randomness |
| 114 | +* Values bounded to \[0,100] |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## Categorical & Nested Data |
| 119 | + |
| 120 | +**Use-case**: Create user profiles with multi-valued interests. |
| 121 | + |
| 122 | +```sql |
| 123 | +CREATE TABLE user_profiles ( |
| 124 | + user_id UInt32, |
| 125 | + interests Array(String), |
| 126 | + scores Array(UInt8) |
| 127 | +) ENGINE = MergeTree |
| 128 | +ORDER BY user_id; |
| 129 | + |
| 130 | +INSERT INTO user_profiles |
| 131 | +SELECT |
| 132 | + number AS user_id, |
| 133 | + arrayShuffle(['sports','music','tech'])[1 + rand() % 3 : 1 + rand() % 3] AS interests, |
| 134 | + [rand() % 100, rand() % 100, rand() % 100] AS scores |
| 135 | +FROM numbers(20000); |
| 136 | +``` |
| 137 | + |
| 138 | +* Random array length between 1–3 |
| 139 | +* Three per-user scores for each interest |
| 140 | + |
| 141 | +:::tip |
| 142 | +Read the [Generating Random Data in ClickHouse](https://clickhouse.com/blog/generating-random-test-distribution-data-for-clickhouse) blog for even more examples. |
| 143 | +::: |
0 commit comments