|
| 1 | +--- |
| 2 | +slug: /cloud/guides/data-masking |
| 3 | +sidebar_label: 'Data masking' |
| 4 | +title: 'Data masking in ClickHouse' |
| 5 | +description: 'A guide to data masking in ClickHouse' |
| 6 | +keywords: ['data masking'] |
| 7 | +--- |
| 8 | + |
| 9 | +# Data masking in ClickHouse |
| 10 | + |
| 11 | +Data masking is a technique used for data protection, in which the original data is replaced with a version of the data which maintains its format and structure while removing any personally identifiable information (PII) or sensitive information. |
| 12 | +This guide shows you how you can mask data in ClickHouse. |
| 13 | + |
| 14 | +## Use string replacement functions {#using-string-functions} |
| 15 | + |
| 16 | +For basic data masking use cases, the `replace` family of functions offers a convenient way to mask data: |
| 17 | + |
| 18 | +| Function | Description | |
| 19 | +|------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 20 | +| [`replaceOne`](/sql-reference/functions/string-replace-functions#replaceone) | Replaces the first occurrence of a pattern in a haystack string with the provided replacement string. | |
| 21 | +| [`replaceAll`](/sql-reference/functions/string-replace-functions#replaceall) | Replaces all occurrences of a pattern in a haystack string with the provided replacement string. | |
| 22 | +| [`replaceRegexpOne`](/sql-reference/functions/string-replace-functions#replaceregexpone) | Replaces the first occurrence of a substring matching a regular expression pattern (in re2 syntax) in a haystack with the provided replacement string. | |
| 23 | +| [`replaceRegexpAll`](/sql-reference/functions/string-replace-functions#replaceregexpall) | Replaces all occurrences of a substring matching a regular expression pattern (in re2 syntax) in a haystack with the provided replacement string. | |
| 24 | + |
| 25 | +For example, you can replace the name "John Smith" with a placeholder `[CUSTOMER_NAME]` using the `replaceOne` function: |
| 26 | + |
| 27 | +```sql title="Query" |
| 28 | +SELECT replaceOne( |
| 29 | + 'Customer John Smith called about his account', |
| 30 | + 'John Smith', |
| 31 | + '[CUSTOMER_NAME]' |
| 32 | +) AS anonymized_text; |
| 33 | +``` |
| 34 | + |
| 35 | +```response title="Response" |
| 36 | +┌─anonymized_text───────────────────────────────────┐ |
| 37 | +│ Customer [CUSTOMER_NAME] called about his account │ |
| 38 | +└───────────────────────────────────────────────────┘ |
| 39 | +``` |
| 40 | + |
| 41 | +More generically, you can use the `replaceRegexpOne` to replace any customer name: |
| 42 | + |
| 43 | +```sql title="Query" |
| 44 | +SELECT |
| 45 | + replaceRegexpAll( |
| 46 | + 'Customer John Smith called. Later, Mary Johnson and Bob Wilson also called.', |
| 47 | + '\\b[A-Z][a-z]+ [A-Z][a-z]+\\b', |
| 48 | + '[CUSTOMER_NAME]' |
| 49 | + ) AS anonymized_text; |
| 50 | +``` |
| 51 | + |
| 52 | +```response title="Response" |
| 53 | +┌─anonymized_text───────────────────────────────────────────────────────────────────────┐ |
| 54 | +│ [CUSTOMER_NAME] Smith called. Later, [CUSTOMER_NAME] and [CUSTOMER_NAME] also called. │ |
| 55 | +└───────────────────────────────────────────────────────────────────────────────────────┘ |
| 56 | +``` |
| 57 | + |
| 58 | +Or you could mask a social security number, leaving only the last 4 digits using the `replaceRegexpAll` function. |
| 59 | + |
| 60 | +```sql title="Query" |
| 61 | +SELECT replaceRegexpAll( |
| 62 | + 'SSN: 123-45-6789', |
| 63 | + '(\d{3})-(\d{2})-(\d{4})', |
| 64 | + 'XXX-XX-\3' |
| 65 | +) AS masked_ssn; |
| 66 | +``` |
| 67 | + |
| 68 | +In the query above `\3` is used to substitute the third capture group into the resulting string, which produces: |
| 69 | + |
| 70 | +```response title="Response" |
| 71 | +┌─masked_ssn───────┐ |
| 72 | +│ SSN: XXX-XX-6789 │ |
| 73 | +└──────────────────┘ |
| 74 | +``` |
| 75 | + |
| 76 | +## Create masked `VIEW`s {#masked-views} |
| 77 | + |
| 78 | +A [`VIEW`](/sql-reference/statements/create/view) can be used in conjunction with the aforementioned string functions to apply transformations to columns containing sensitive data, before they are presented to the user. |
| 79 | +In this way, the original data remains unchanged, and users querying the view see only the masked data. |
| 80 | + |
| 81 | +To demonstrate, let's imagine that we have a table which stores records of customer orders. |
| 82 | +We want to make sure that a group of employees can view the information, but we don't want them to see the full information of the customers. |
| 83 | + |
| 84 | +Run the query below to create an example table `orders` and insert some fictional customer order records into it: |
| 85 | + |
| 86 | +```sql |
| 87 | +CREATE TABLE orders ( |
| 88 | + user_id UInt32, |
| 89 | + name String, |
| 90 | + email String, |
| 91 | + phone String, |
| 92 | + total_amount Decimal(10,2), |
| 93 | + order_date Date, |
| 94 | + shipping_address String |
| 95 | +) |
| 96 | +ENGINE = MergeTree() |
| 97 | +ORDER BY user_id; |
| 98 | + |
| 99 | +INSERT INTO orders VALUES |
| 100 | + (1001, 'John Smith', 'john.smith@gmail.com', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'), |
| 101 | + (1002, 'Sarah Johnson', 'sarah.johnson@outlook.com', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'), |
| 102 | + (1003, 'Michael Brown', 'mbrown@company.com', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'), |
| 103 | + (1004, 'Emily Rogers', 'emily.rogers@yahoo.com', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'), |
| 104 | + (1005, 'David Wilson', 'dwilson@email.net', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001'); |
| 105 | +``` |
| 106 | + |
| 107 | +Create a view called `masked_orders`: |
| 108 | + |
| 109 | +```sql |
| 110 | +CREATE VIEW masked_orders AS |
| 111 | +SELECT |
| 112 | + user_id, |
| 113 | + replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****') AS name, |
| 114 | + replaceRegexpOne(email, '^(.{0})[^@]*(@.*)$', '\\1****\\2') AS email, |
| 115 | + replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3') AS phone, |
| 116 | + total_amount, |
| 117 | + order_date, |
| 118 | + replaceRegexpOne(shipping_address, '^[^,]+,\\s*(.*)$', '*** \\1') AS shipping_address |
| 119 | +FROM orders; |
| 120 | +``` |
| 121 | + |
| 122 | +In the `SELECT` clause of the view creation query above, we define transformations using the `replaceRegexpOne` on the `name`, `email`, `phone` and `shipping_address` fields, which are the fields containing sensitive information that we wish to partially mask. |
| 123 | + |
| 124 | +Select the data from the view: |
| 125 | + |
| 126 | +```sql title="Query" |
| 127 | +SELECT * FROM masked_orders |
| 128 | +``` |
| 129 | + |
| 130 | +```response title="Response" |
| 131 | +┌─user_id─┬─name─────────┬─email──────────────┬─phone────────┬─total_amount─┬─order_date─┬─shipping_address──────────┐ |
| 132 | +│ 1001 │ John **** │ jo****@gmail.com │ 555-***-4567 │ 299.99 │ 2024-01-15 │ *** New York, NY 10001 │ |
| 133 | +│ 1002 │ Sarah **** │ sa****@outlook.com │ 555-***-6543 │ 149.5 │ 2024-01-16 │ *** Los Angeles, CA 90210 │ |
| 134 | +│ 1003 │ Michael **** │ mb****@company.com │ 555-***-7890 │ 599 │ 2024-01-17 │ *** Chicago, IL 60601 │ |
| 135 | +│ 1004 │ Emily **** │ em****@yahoo.com │ 555-***-0987 │ 89.99 │ 2024-01-18 │ *** Houston, TX 77001 │ |
| 136 | +│ 1005 │ David **** │ dw****@email.net │ 555-***-3210 │ 449.75 │ 2024-01-19 │ *** Phoenix, AZ 85001 │ |
| 137 | +└─────────┴──────────────┴────────────────────┴──────────────┴──────────────┴────────────┴───────────────────────────┘ |
| 138 | +``` |
| 139 | + |
| 140 | +Notice that the data returned from the view is partially masked, obfuscating the sensitive information. |
| 141 | +You can also create multiple views, with differing levels of obfuscation depending on the level of privileged access to information the viewer has. |
| 142 | + |
| 143 | +To ensure that users are only able to access the view returning the masked data, and not the table with the original unmasked data, you should use [Role Based Access Control](/cloud/security/cloud-access-management/overview) to ensure that specific roles only have grants to select from the view. |
| 144 | + |
| 145 | +First create the role: |
| 146 | + |
| 147 | +```sql |
| 148 | +CREATE ROLE masked_orders_viewer; |
| 149 | +``` |
| 150 | + |
| 151 | +Next grant `SELECT` privileges on the view to the role: |
| 152 | + |
| 153 | +```sql |
| 154 | +GRANT SELECT ON masked_orders TO masked_orders_viewer; |
| 155 | +``` |
| 156 | + |
| 157 | +Because ClickHouse roles are additive, you must ensure that users who should only see the masked view do not have any `SELECT` privilege on the base table via any role. |
| 158 | + |
| 159 | +As such, you should explicitly revoke base-table access to be safe: |
| 160 | + |
| 161 | +```sql |
| 162 | +REVOKE SELECT ON orders FROM masked_orders_viewer; |
| 163 | +``` |
| 164 | + |
| 165 | +Finally, assign the role to the appropriate users: |
| 166 | + |
| 167 | +```sql |
| 168 | +GRANT masked_orders_viewer TO your_user; |
| 169 | +``` |
| 170 | + |
| 171 | +This ensures that users with the `masked_orders_viewer` role are only able to see |
| 172 | +the masked data from the view and not the original unmasked data from the table. |
| 173 | + |
| 174 | +## Use `MATERIALIZED` columns and column-level access restrictions {#materialized-ephemeral-column-restrictions} |
| 175 | + |
| 176 | +In cases where you don't want to create a separate view, you can store masked versions of your data alongside the original data. |
| 177 | +To do so, you can use [materialized columns](/sql-reference/statements/create/table#materialized). |
| 178 | +Values of such columns are automatically calculated according to the specified materialized expression when rows are inserted, |
| 179 | +and we can use them to create new columns with masked versions of the data. |
| 180 | + |
| 181 | +Taking the example before, instead of creating a separate `VIEW` for the masked data, we'll now create masked columns using `MATERIALIZED`: |
| 182 | + |
| 183 | +```sql |
| 184 | +DROP TABLE IF EXISTS orders; |
| 185 | +CREATE TABLE orders ( |
| 186 | + user_id UInt32, |
| 187 | + name String, |
| 188 | + name_masked String MATERIALIZED replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****'), |
| 189 | + email String, |
| 190 | + email_masked String MATERIALIZED replaceRegexpOne(email, '^(.{0})[^@]*(@.*)$', '\\1****\\2'), |
| 191 | + phone String, |
| 192 | + phone_masked String MATERIALIZED replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3'), |
| 193 | + total_amount Decimal(10,2), |
| 194 | + order_date Date, |
| 195 | + shipping_address String, |
| 196 | + shipping_address_masked String MATERIALIZED replaceRegexpOne(shipping_address, '^[^,]+,\\s*(.*)$', '*** \\1') |
| 197 | +) |
| 198 | +ENGINE = MergeTree() |
| 199 | +ORDER BY user_id; |
| 200 | + |
| 201 | +INSERT INTO orders VALUES |
| 202 | + (1001, 'John Smith', 'john.smith@gmail.com', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'), |
| 203 | + (1002, 'Sarah Johnson', 'sarah.johnson@outlook.com', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'), |
| 204 | + (1003, 'Michael Brown', 'mbrown@company.com', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'), |
| 205 | + (1004, 'Emily Rogers', 'emily.rogers@yahoo.com', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'), |
| 206 | + (1005, 'David Wilson', 'dwilson@email.net', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001'); |
| 207 | +``` |
| 208 | + |
| 209 | +If you now run the following select query, you will see that the masked data is 'materialized' at insert time and stored alongside the original, unmasked data. |
| 210 | +It is necessary to explicitly select the masked columns as ClickHouse doesn't automatically include materialized columns in `SELECT *` queries by default. |
| 211 | + |
| 212 | +```sql title="Query" |
| 213 | +SELECT |
| 214 | + *, |
| 215 | + name_masked, |
| 216 | + email_masked, |
| 217 | + phone_masked, |
| 218 | + shipping_address_masked |
| 219 | +FROM orders |
| 220 | +ORDER BY user_id ASC |
| 221 | +``` |
| 222 | + |
| 223 | +```response title="Response" |
| 224 | + ┌─user_id─┬─name──────────┬─email─────────────────────┬─phone────────┬─total_amount─┬─order_date─┬─shipping_address───────────────────┬─name_masked──┬─email_masked───────┬─phone_masked─┬─shipping_address_masked────┐ |
| 225 | +1. │ 1001 │ John Smith │ john.smith@gmail.com │ 555-123-4567 │ 299.99 │ 2024-01-15 │ 123 Main St, New York, NY 10001 │ John **** │ jo****@gmail.com │ 555-***-4567 │ **** New York, NY 10001 │ |
| 226 | +2. │ 1002 │ Sarah Johnson │ sarah.johnson@outlook.com │ 555-987-6543 │ 149.5 │ 2024-01-16 │ 456 Oak Ave, Los Angeles, CA 90210 │ Sarah **** │ sa****@outlook.com │ 555-***-6543 │ **** Los Angeles, CA 90210 │ |
| 227 | +3. │ 1003 │ Michael Brown │ mbrown@company.com │ 555-456-7890 │ 599 │ 2024-01-17 │ 789 Pine Rd, Chicago, IL 60601 │ Michael **** │ mb****@company.com │ 555-***-7890 │ **** Chicago, IL 60601 │ |
| 228 | +4. │ 1004 │ Emily Rogers │ emily.rogers@yahoo.com │ 555-321-0987 │ 89.99 │ 2024-01-18 │ 321 Elm St, Houston, TX 77001 │ Emily **** │ em****@yahoo.com │ 555-***-0987 │ **** Houston, TX 77001 │ |
| 229 | +5. │ 1005 │ David Wilson │ dwilson@email.net │ 555-654-3210 │ 449.75 │ 2024-01-19 │ 654 Cedar Blvd, Phoenix, AZ 85001 │ David **** │ dw****@email.net │ 555-***-3210 │ **** Phoenix, AZ 85001 │ |
| 230 | + └─────────┴───────────────┴───────────────────────────┴──────────────┴──────────────┴────────────┴────────────────────────────────────┴──────────────┴────────────────────┴──────────────┴────────────────────────────┘ |
| 231 | +``` |
| 232 | + |
| 233 | +To ensure that users are only able to access columns containing the masked data, you can again use [Role Based Access Control](/cloud/security/cloud-access-management/overview) to ensure that specific roles only have grants to select on masked columns from `orders`. |
| 234 | + |
| 235 | +Recreate the role that we made previously: |
| 236 | + |
| 237 | +```sql |
| 238 | +DROP ROLE IF EXISTS masked_order_viewer; |
| 239 | +CREATE ROLE masked_order_viewer; |
| 240 | +``` |
| 241 | + |
| 242 | +Next, grant `SELECT` permission to the `orders` table: |
| 243 | + |
| 244 | +```sql |
| 245 | +GRANT SELECT ON orders TO masked_data_reader; |
| 246 | +``` |
| 247 | + |
| 248 | +Revoke access to any sensitive columns: |
| 249 | + |
| 250 | +```sql |
| 251 | +REVOKE SELECT(name) ON orders FROM masked_data_reader; |
| 252 | +REVOKE SELECT(email) ON orders FROM masked_data_reader; |
| 253 | +REVOKE SELECT(phone) ON orders FROM masked_data_reader; |
| 254 | +REVOKE SELECT(shipping_address) ON orders FROM masked_data_reader; |
| 255 | +``` |
| 256 | + |
| 257 | +Finally, assign the role to the appropriate users: |
| 258 | + |
| 259 | +```sql |
| 260 | +GRANT masked_orders_viewer TO your_user; |
| 261 | +``` |
| 262 | + |
| 263 | +In the case where you want to store only the masked data in the `orders` table, |
| 264 | +you can mark the sensitive unmasked columns as [`EPHEMERAL`](/sql-reference/statements/create/table#ephemeral), |
| 265 | +which will ensure that columns of this type are not stored in the table. |
| 266 | + |
| 267 | +```sql |
| 268 | +DROP TABLE IF EXISTS orders; |
| 269 | +CREATE TABLE orders ( |
| 270 | + user_id UInt32, |
| 271 | + name String EPHEMERAL, |
| 272 | + name_masked String MATERIALIZED replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****'), |
| 273 | + email String EPHEMERAL, |
| 274 | + email_masked String MATERIALIZED replaceRegexpOne(email, '^(.{2})[^@]*(@.*)$', '\\1****\\2'), |
| 275 | + phone String EPHEMERAL, |
| 276 | + phone_masked String MATERIALIZED replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3'), |
| 277 | + total_amount Decimal(10,2), |
| 278 | + order_date Date, |
| 279 | + shipping_address String EPHEMERAL, |
| 280 | + shipping_address_masked String MATERIALIZED replaceRegexpOne(shipping_address, '^([^,]+),\\s*(.*)$', '*** \\2') |
| 281 | +) |
| 282 | +ENGINE = MergeTree() |
| 283 | +ORDER BY user_id; |
| 284 | + |
| 285 | +INSERT INTO orders (user_id, name, email, phone, total_amount, order_date, shipping_address) VALUES |
| 286 | + (1001, 'John Smith', 'john.smith@gmail.com', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'), |
| 287 | + (1002, 'Sarah Johnson', 'sarah.johnson@outlook.com', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'), |
| 288 | + (1003, 'Michael Brown', 'mbrown@company.com', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'), |
| 289 | + (1004, 'Emily Rogers', 'emily.rogers@yahoo.com', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'), |
| 290 | + (1005, 'David Wilson', 'dwilson@email.net', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001'); |
| 291 | +``` |
| 292 | + |
| 293 | +If we run the same query as before, you'll now see that only the materialized masked data was inserted into the table: |
| 294 | + |
| 295 | +```sql title="Query" |
| 296 | +SELECT |
| 297 | + *, |
| 298 | + name_masked, |
| 299 | + email_masked, |
| 300 | + phone_masked, |
| 301 | + shipping_address_masked |
| 302 | +FROM orders |
| 303 | +ORDER BY user_id ASC |
| 304 | +``` |
| 305 | + |
| 306 | +```response title="Response" |
| 307 | + ┌─user_id─┬─total_amount─┬─order_date─┬─name_masked──┬─email_masked───────┬─phone_masked─┬─shipping_address_masked───┐ |
| 308 | +1. │ 1001 │ 299.99 │ 2024-01-15 │ John **** │ jo****@gmail.com │ 555-***-4567 │ *** New York, NY 10001 │ |
| 309 | +2. │ 1002 │ 149.5 │ 2024-01-16 │ Sarah **** │ sa****@outlook.com │ 555-***-6543 │ *** Los Angeles, CA 90210 │ |
| 310 | +3. │ 1003 │ 599 │ 2024-01-17 │ Michael **** │ mb****@company.com │ 555-***-7890 │ *** Chicago, IL 60601 │ |
| 311 | +4. │ 1004 │ 89.99 │ 2024-01-18 │ Emily **** │ em****@yahoo.com │ 555-***-0987 │ *** Houston, TX 77001 │ |
| 312 | +5. │ 1005 │ 449.75 │ 2024-01-19 │ David **** │ dw****@email.net │ 555-***-3210 │ *** Phoenix, AZ 85001 │ |
| 313 | + └─────────┴──────────────┴────────────┴──────────────┴────────────────────┴──────────────┴───────────────────────────┘ |
| 314 | +``` |
| 315 | + |
| 316 | +## Use query masking rules for log data {#use-query-masking-rules} |
| 317 | + |
| 318 | +For users of ClickHouse OSS wishing to mask log data specifically, you can make use of [query masking rules](/operations/server-configuration-parameters/settings#query_masking_rules) (log masking) to mask data. |
| 319 | + |
| 320 | +To do so, you can define regular expression-based masking rules in the server configuration. |
| 321 | +These rules are applied to queries and all log messages before they are stored in server logs or system tables (such as `system.query_log`, `system.text_log`, and `system.processes`). |
| 322 | + |
| 323 | +This helps prevent sensitive data from leaking into **logs** only. |
| 324 | +Note that it does not mask data in query results. |
| 325 | + |
| 326 | +For example, to mask a social security number, you could add the following rule to your [server configuration](/operations/configuration-files): |
| 327 | + |
| 328 | +```yaml |
| 329 | +<query_masking_rules> |
| 330 | + <rule> |
| 331 | + <name>hide SSN</name> |
| 332 | + <regexp>(^|\D)\d{3}-\d{2}-\d{4}($|\D)</regexp> |
| 333 | + <replace>000-00-0000</replace> |
| 334 | + </rule> |
| 335 | +</query_masking_rules> |
| 336 | +``` |
0 commit comments