|
| 1 | +--- |
| 2 | +sidebar_label: 'Generating random test data' |
| 3 | +title: 'Generating random test data in ClickHouse' |
| 4 | +slug: /guides/generating-test-data |
| 5 | +description: 'Learn about Generating Random Test Data in ClickHouse' |
| 6 | +show_related_blogs: true |
| 7 | +doc_type: 'guide' |
| 8 | +keywords: ['random data', 'test data'] |
| 9 | +--- |
| 10 | + |
| 11 | +# Generating random test data in ClickHouse |
| 12 | + |
| 13 | +Generating random data is useful when testing new use cases or benchmarking your implementation. |
| 14 | +ClickHouse has a [wide range of functions for generating random data](/sql-reference/functions/random-functions) that, in many cases, avoid the need for an external data generator. |
| 15 | + |
| 16 | +This guide provides several examples of how to generate random datasets in ClickHouse with different randomness requirements. |
| 17 | + |
| 18 | +## Simple uniform dataset {#simple-uniform-dataset} |
| 19 | + |
| 20 | +**Use-case**: Generate a quick dataset of user events with random timestamps and event types. |
| 21 | + |
| 22 | +```sql |
| 23 | +CREATE TABLE user_events ( |
| 24 | + event_id UUID, |
| 25 | + user_id UInt32, |
| 26 | + event_type LowCardinality(String), |
| 27 | + event_time DateTime |
| 28 | +) ENGINE = MergeTree |
| 29 | +ORDER BY event_time; |
| 30 | + |
| 31 | +INSERT INTO user_events |
| 32 | +SELECT |
| 33 | + generateUUIDv4() AS event_id, |
| 34 | + rand() % 10000 AS user_id, |
| 35 | + arrayJoin(['click','view','purchase']) AS event_type, |
| 36 | + now() - INTERVAL rand() % 3600*24 SECOND AS event_time |
| 37 | +FROM numbers(1000000); |
| 38 | +``` |
| 39 | + |
| 40 | +* `rand() % 10000`: uniform distribution of 10k users |
| 41 | +* `arrayJoin(...)`: randomly selects one of three event types |
| 42 | +* Timestamps spread over the previous 24 hours |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## Exponential distribution {#exponential-distribution} |
| 47 | + |
| 48 | +**Use-case**: Simulate purchase amounts where most values are low, but a few are high. |
| 49 | + |
| 50 | +```sql |
| 51 | +CREATE TABLE purchases ( |
| 52 | + dt DateTime, |
| 53 | + customer_id UInt32, |
| 54 | + total_spent Float32 |
| 55 | +) ENGINE = MergeTree |
| 56 | +ORDER BY dt; |
| 57 | + |
| 58 | +INSERT INTO purchases |
| 59 | +SELECT |
| 60 | + now() - INTERVAL randUniform(1,1_000_000) SECOND AS dt, |
| 61 | + number AS customer_id, |
| 62 | + 15 + round(randExponential(1/10), 2) AS total_spent |
| 63 | +FROM numbers(500000); |
| 64 | +``` |
| 65 | + |
| 66 | +* Uniform timestamps over recent period |
| 67 | +* `randExponential(1/10)` — most totals near 0, offset by 15 as a minimum ([ClickHouse][1], [ClickHouse][2], [Atlantic.Net][3], [GitHub][4]) |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## Time-distributed events (Poisson) {#poisson-distribution} |
| 72 | + |
| 73 | +**Use-case**: Simulate event arrivals that cluster around a specific period (e.g., peak hour). |
| 74 | + |
| 75 | +```sql |
| 76 | +CREATE TABLE events ( |
| 77 | + dt DateTime, |
| 78 | + event_type String |
| 79 | +) ENGINE = MergeTree |
| 80 | +ORDER BY dt; |
| 81 | + |
| 82 | +INSERT INTO events |
| 83 | +SELECT |
| 84 | + toDateTime('2022-12-12 12:00:00') |
| 85 | + - ((12 + randPoisson(12)) * 3600) AS dt, |
| 86 | + 'click' AS event_type |
| 87 | +FROM numbers(200000); |
| 88 | +``` |
| 89 | + |
| 90 | +* Events peak around noon, with Poisson-distributed deviation |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## Time-varying normal distribution {#time-varying-normal-distribution} |
| 95 | + |
| 96 | +**Use-case**: Emulate system metrics (e.g., CPU usage) that vary over time. |
| 97 | + |
| 98 | +```sql |
| 99 | +CREATE TABLE cpu_metrics ( |
| 100 | + host String, |
| 101 | + ts DateTime, |
| 102 | + usage Float32 |
| 103 | +) ENGINE = MergeTree |
| 104 | +ORDER BY (host, ts); |
| 105 | + |
| 106 | +INSERT INTO cpu_metrics |
| 107 | +SELECT |
| 108 | + arrayJoin(['host1','host2','host3']) AS host, |
| 109 | + now() - INTERVAL number SECOND AS ts, |
| 110 | + greatest(0.0, least(100.0, |
| 111 | + randNormal(50 + 30*sin(toUInt32(ts)%86400/86400*2*pi()), 10) |
| 112 | + )) AS usage |
| 113 | +FROM numbers(10000); |
| 114 | +``` |
| 115 | + |
| 116 | +* `usage` follows a diurnal sine wave + randomness |
| 117 | +* Values bounded to \[0,100] |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## Categorical and nested data {#categorical-and-nested-data} |
| 122 | + |
| 123 | +**Use-case**: Create user profiles with multi-valued interests. |
| 124 | + |
| 125 | +```sql |
| 126 | +CREATE TABLE user_profiles ( |
| 127 | + user_id UInt32, |
| 128 | + interests Array(String), |
| 129 | + scores Array(UInt8) |
| 130 | +) ENGINE = MergeTree |
| 131 | +ORDER BY user_id; |
| 132 | + |
| 133 | +INSERT INTO user_profiles |
| 134 | +SELECT |
| 135 | + number AS user_id, |
| 136 | + arrayShuffle(['sports','music','tech'])[1 + rand() % 3 : 1 + rand() % 3] AS interests, |
| 137 | + [rand() % 100, rand() % 100, rand() % 100] AS scores |
| 138 | +FROM numbers(20000); |
| 139 | +``` |
| 140 | + |
| 141 | +* Random array length between 1–3 |
| 142 | +* Three per-user scores for each interest |
| 143 | + |
| 144 | +:::tip |
| 145 | +Read the [Generating Random Data in ClickHouse](https://clickhouse.com/blog/generating-random-test-distribution-data-for-clickhouse) blog for even more examples. |
| 146 | +::: |
| 147 | + |
| 148 | +## Generating random tables {#generating-random-tables} |
| 149 | + |
| 150 | +The [`generateRandomStructure`](/sql-reference/functions/other-functions#generaterandomstructure) function is particularly useful when combined with the [`generateRandom`](/sql-reference/table-functions/generate) table engine for testing, benchmarking, or creating mock data with arbitrary schemas. |
| 151 | + |
| 152 | +Let's start by just seeing what a random structure looks like using the `generateRandomStructure` function: |
| 153 | + |
| 154 | +```sql |
| 155 | +SELECT generateRandomStructure(5); |
| 156 | +``` |
| 157 | + |
| 158 | +You might see something like: |
| 159 | + |
| 160 | +```response |
| 161 | +c1 UInt32, c2 Array(String), c3 DateTime, c4 Nullable(Float64), c5 Map(String, Int16) |
| 162 | +``` |
| 163 | + |
| 164 | +You can also use a seed to get the same structure every time: |
| 165 | + |
| 166 | +```sql |
| 167 | +SELECT generateRandomStructure(3, 42); |
| 168 | +``` |
| 169 | + |
| 170 | +```response |
| 171 | +c1 String, c2 Array(Nullable(Int32)), c3 Tuple(UInt8, Date) |
| 172 | +``` |
| 173 | + |
| 174 | +Now let's create an actual table and fill it with random data: |
| 175 | + |
| 176 | +```sql |
| 177 | +CREATE TABLE my_test_table |
| 178 | +ENGINE = MergeTree |
| 179 | +ORDER BY tuple() |
| 180 | +AS SELECT * |
| 181 | +FROM generateRandom( |
| 182 | + 'col1 UInt32, col2 String, col3 Float64, col4 DateTime', |
| 183 | + 1, -- seed for data generation |
| 184 | + 10 -- number of different random values |
| 185 | +) |
| 186 | +LIMIT 100; -- 100 rows |
| 187 | + |
| 188 | +-- Step 2: Query your new table |
| 189 | +SELECT * FROM my_test_table LIMIT 5; |
| 190 | +``` |
| 191 | + |
| 192 | +```response |
| 193 | +┌───────col1─┬─col2──────┬─────────────────────col3─┬────────────────col4─┐ |
| 194 | +│ 4107652264 │ &b!M-e;7 │ 1.0013455832230728e-158 │ 2059-08-14 19:03:26 │ |
| 195 | +│ 652895061 │ Dj7peUH{T │ -1.032074207667996e112 │ 2079-10-06 04:18:16 │ |
| 196 | +│ 2319105779 │ =D[ │ -2.066555415720528e88 │ 2015-04-26 11:44:13 │ |
| 197 | +│ 1835960063 │ _@}a │ -1.4998020545039013e110 │ 2063-03-03 20:36:55 │ |
| 198 | +│ 730412674 │ _}! │ -1.3578492992094465e-275 │ 2098-08-23 18:23:37 │ |
| 199 | +└────────────┴───────────┴──────────────────────────┴─────────────────────┘ |
| 200 | +``` |
| 201 | + |
| 202 | +Let's combine both functions for a completely random table. |
| 203 | +First, see what structure we'll get: |
| 204 | + |
| 205 | +```sql |
| 206 | +SELECT generateRandomStructure(7, 123) AS structure FORMAT vertical; |
| 207 | +``` |
| 208 | + |
| 209 | +```response |
| 210 | +┌─structure──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ |
| 211 | +│ c1 Decimal64(7), c2 Enum16('c2V0' = -21744, 'c2V1' = 5380), c3 Int8, c4 UUID, c5 UUID, c6 FixedString(190), c7 Map(Enum16('c7V0' = -19581, 'c7V1' = -10024, 'c7V2' = 27615, 'c7V3' = -10177, 'c7V4' = -19644, 'c7V5' = 3554, 'c7V6' = 29073, 'c7V7' = 28800, 'c7V8' = -11512), Float64) │ |
| 212 | +└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ |
| 213 | +``` |
| 214 | + |
| 215 | +Now create the table with that structure and use the `DESCRIBE` statement to see what we created: |
| 216 | + |
| 217 | +```sql |
| 218 | +CREATE TABLE fully_random_table |
| 219 | +ENGINE = MergeTree |
| 220 | +ORDER BY tuple() |
| 221 | +AS SELECT * |
| 222 | +FROM generateRandom(generateRandomStructure(7, 123), 1, 10) |
| 223 | +LIMIT 1000; |
| 224 | + |
| 225 | +DESCRIBE TABLE fully_random_table; |
| 226 | +``` |
| 227 | + |
| 228 | +```response |
| 229 | + ┌─name─┬─type─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐ |
| 230 | +1. │ c1 │ Decimal(18, 7) │ │ │ │ │ │ |
| 231 | +2. │ c2 │ Enum16('c2V0' = -21744, 'c2V1' = 5380) │ │ │ │ │ │ |
| 232 | +3. │ c3 │ Int8 │ │ │ │ │ │ |
| 233 | +4. │ c4 │ UUID │ │ │ │ │ │ |
| 234 | +5. │ c5 │ UUID │ │ │ │ │ │ |
| 235 | +6. │ c6 │ FixedString(190) │ │ │ │ │ │ |
| 236 | +7. │ c7 │ Map(Enum16('c7V4' = -19644, 'c7V0' = -19581, 'c7V8' = -11512, 'c7V3' = -10177, 'c7V1' = -10024, 'c7V5' = 3554, 'c7V2' = 27615, 'c7V7' = 28800, 'c7V6' = 29073), Float64) │ │ │ │ │ │ |
| 237 | + └──────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘ |
| 238 | +``` |
| 239 | + |
| 240 | +Inspect the first row for a sample of the generated data: |
| 241 | + |
| 242 | +```sql |
| 243 | +SELECT * FROM fully_random_table LIMIT 1 FORMAT vertical; |
| 244 | +``` |
| 245 | + |
| 246 | +```response |
| 247 | +Row 1: |
| 248 | +────── |
| 249 | +c1: 80416293882.257732 -- 80.42 billion |
| 250 | +c2: c2V1 |
| 251 | +c3: -84 |
| 252 | +c4: 1a9429b3-fd8b-1d72-502f-c051aeb7018e |
| 253 | +c5: 7407421a-031f-eb3b-8571-44ff279ddd36 |
| 254 | +c6: g̅b�&��rҵ���5C�\�|��H�>���l'V3��R�[��=3�G�LwVMR*s緾/2�J.���6#��(�h>�lە��L^�M�:�R�9%d�ž�zv��W����Y�S��_no��BP+��u��.0��UZ!x�@7:�nj%3�Λd�S�k>���w��|�&��~ |
| 255 | +c7: {'c7V8':-1.160941256852442} |
| 256 | +``` |
0 commit comments