Skip to content

Commit cf30842

Browse files
authored
Merge pull request #4444 from Blargian/data_Masking_guide
Add guide on data masking
2 parents e20d37d + 287cb65 commit cf30842

File tree

1 file changed

+336
-0
lines changed

1 file changed

+336
-0
lines changed
Lines changed: 336 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,336 @@
1+
---
2+
slug: /cloud/guides/data-masking
3+
sidebar_label: 'Data masking'
4+
title: 'Data masking in ClickHouse'
5+
description: 'A guide to data masking in ClickHouse'
6+
keywords: ['data masking']
7+
---
8+
9+
# Data masking in ClickHouse
10+
11+
Data masking is a technique used for data protection, in which the original data is replaced with a version of the data which maintains its format and structure while removing any personally identifiable information (PII) or sensitive information.
12+
This guide shows you how you can mask data in ClickHouse.
13+
14+
## Use string replacement functions {#using-string-functions}
15+
16+
For basic data masking use cases, the `replace` family of functions offers a convenient way to mask data:
17+
18+
| Function | Description |
19+
|------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
20+
| [`replaceOne`](/sql-reference/functions/string-replace-functions#replaceone) | Replaces the first occurrence of a pattern in a haystack string with the provided replacement string. |
21+
| [`replaceAll`](/sql-reference/functions/string-replace-functions#replaceall) | Replaces all occurrences of a pattern in a haystack string with the provided replacement string. |
22+
| [`replaceRegexpOne`](/sql-reference/functions/string-replace-functions#replaceregexpone) | Replaces the first occurrence of a substring matching a regular expression pattern (in re2 syntax) in a haystack with the provided replacement string. |
23+
| [`replaceRegexpAll`](/sql-reference/functions/string-replace-functions#replaceregexpall) | Replaces all occurrences of a substring matching a regular expression pattern (in re2 syntax) in a haystack with the provided replacement string. |
24+
25+
For example, you can replace the name "John Smith" with a placeholder `[CUSTOMER_NAME]` using the `replaceOne` function:
26+
27+
```sql title="Query"
28+
SELECT replaceOne(
29+
'Customer John Smith called about his account',
30+
'John Smith',
31+
'[CUSTOMER_NAME]'
32+
) AS anonymized_text;
33+
```
34+
35+
```response title="Response"
36+
┌─anonymized_text───────────────────────────────────┐
37+
│ Customer [CUSTOMER_NAME] called about his account │
38+
└───────────────────────────────────────────────────┘
39+
```
40+
41+
More generically, you can use the `replaceRegexpOne` to replace any customer name:
42+
43+
```sql title="Query"
44+
SELECT
45+
replaceRegexpAll(
46+
'Customer John Smith called. Later, Mary Johnson and Bob Wilson also called.',
47+
'\\b[A-Z][a-z]+ [A-Z][a-z]+\\b',
48+
'[CUSTOMER_NAME]'
49+
) AS anonymized_text;
50+
```
51+
52+
```response title="Response"
53+
┌─anonymized_text───────────────────────────────────────────────────────────────────────┐
54+
│ [CUSTOMER_NAME] Smith called. Later, [CUSTOMER_NAME] and [CUSTOMER_NAME] also called. │
55+
└───────────────────────────────────────────────────────────────────────────────────────┘
56+
```
57+
58+
Or you could mask a social security number, leaving only the last 4 digits using the `replaceRegexpAll` function.
59+
60+
```sql title="Query"
61+
SELECT replaceRegexpAll(
62+
'SSN: 123-45-6789',
63+
'(\d{3})-(\d{2})-(\d{4})',
64+
'XXX-XX-\3'
65+
) AS masked_ssn;
66+
```
67+
68+
In the query above `\3` is used to substitute the third capture group into the resulting string, which produces:
69+
70+
```response title="Response"
71+
┌─masked_ssn───────┐
72+
│ SSN: XXX-XX-6789 │
73+
└──────────────────┘
74+
```
75+
76+
## Create masked `VIEW`s {#masked-views}
77+
78+
A [`VIEW`](/sql-reference/statements/create/view) can be used in conjunction with the aforementioned string functions to apply transformations to columns containing sensitive data, before they are presented to the user.
79+
In this way, the original data remains unchanged, and users querying the view see only the masked data.
80+
81+
To demonstrate, let's imagine that we have a table which stores records of customer orders.
82+
We want to make sure that a group of employees can view the information, but we don't want them to see the full information of the customers.
83+
84+
Run the query below to create an example table `orders` and insert some fictional customer order records into it:
85+
86+
```sql
87+
CREATE TABLE orders (
88+
user_id UInt32,
89+
name String,
90+
email String,
91+
phone String,
92+
total_amount Decimal(10,2),
93+
order_date Date,
94+
shipping_address String
95+
)
96+
ENGINE = MergeTree()
97+
ORDER BY user_id;
98+
99+
INSERT INTO orders VALUES
100+
(1001, 'John Smith', 'john.smith@gmail.com', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'),
101+
(1002, 'Sarah Johnson', 'sarah.johnson@outlook.com', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'),
102+
(1003, 'Michael Brown', 'mbrown@company.com', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'),
103+
(1004, 'Emily Rogers', 'emily.rogers@yahoo.com', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'),
104+
(1005, 'David Wilson', 'dwilson@email.net', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001');
105+
```
106+
107+
Create a view called `masked_orders`:
108+
109+
```sql
110+
CREATE VIEW masked_orders AS
111+
SELECT
112+
user_id,
113+
replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****') AS name,
114+
replaceRegexpOne(email, '^(.{0})[^@]*(@.*)$', '\\1****\\2') AS email,
115+
replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3') AS phone,
116+
total_amount,
117+
order_date,
118+
replaceRegexpOne(shipping_address, '^[^,]+,\\s*(.*)$', '*** \\1') AS shipping_address
119+
FROM orders;
120+
```
121+
122+
In the `SELECT` clause of the view creation query above, we define transformations using the `replaceRegexpOne` on the `name`, `email`, `phone` and `shipping_address` fields, which are the fields containing sensitive information that we wish to partially mask.
123+
124+
Select the data from the view:
125+
126+
```sql title="Query"
127+
SELECT * FROM masked_orders
128+
```
129+
130+
```response title="Response"
131+
┌─user_id─┬─name─────────┬─email──────────────┬─phone────────┬─total_amount─┬─order_date─┬─shipping_address──────────┐
132+
│ 1001 │ John **** │ jo****@gmail.com │ 555-***-4567 │ 299.99 │ 2024-01-15 │ *** New York, NY 10001 │
133+
│ 1002 │ Sarah **** │ sa****@outlook.com │ 555-***-6543 │ 149.5 │ 2024-01-16 │ *** Los Angeles, CA 90210 │
134+
│ 1003 │ Michael **** │ mb****@company.com │ 555-***-7890 │ 599 │ 2024-01-17 │ *** Chicago, IL 60601 │
135+
│ 1004 │ Emily **** │ em****@yahoo.com │ 555-***-0987 │ 89.99 │ 2024-01-18 │ *** Houston, TX 77001 │
136+
│ 1005 │ David **** │ dw****@email.net │ 555-***-3210 │ 449.75 │ 2024-01-19 │ *** Phoenix, AZ 85001 │
137+
└─────────┴──────────────┴────────────────────┴──────────────┴──────────────┴────────────┴───────────────────────────┘
138+
```
139+
140+
Notice that the data returned from the view is partially masked, obfuscating the sensitive information.
141+
You can also create multiple views, with differing levels of obfuscation depending on the level of privileged access to information the viewer has.
142+
143+
To ensure that users are only able to access the view returning the masked data, and not the table with the original unmasked data, you should use [Role Based Access Control](/cloud/security/cloud-access-management/overview) to ensure that specific roles only have grants to select from the view.
144+
145+
First create the role:
146+
147+
```sql
148+
CREATE ROLE masked_orders_viewer;
149+
```
150+
151+
Next grant `SELECT` privileges on the view to the role:
152+
153+
```sql
154+
GRANT SELECT ON masked_orders TO masked_orders_viewer;
155+
```
156+
157+
Because ClickHouse roles are additive, you must ensure that users who should only see the masked view do not have any `SELECT` privilege on the base table via any role.
158+
159+
As such, you should explicitly revoke base-table access to be safe:
160+
161+
```sql
162+
REVOKE SELECT ON orders FROM masked_orders_viewer;
163+
```
164+
165+
Finally, assign the role to the appropriate users:
166+
167+
```sql
168+
GRANT masked_orders_viewer TO your_user;
169+
```
170+
171+
This ensures that users with the `masked_orders_viewer` role are only able to see
172+
the masked data from the view and not the original unmasked data from the table.
173+
174+
## Use `MATERIALIZED` columns and column-level access restrictions {#materialized-ephemeral-column-restrictions}
175+
176+
In cases where you don't want to create a separate view, you can store masked versions of your data alongside the original data.
177+
To do so, you can use [materialized columns](/sql-reference/statements/create/table#materialized).
178+
Values of such columns are automatically calculated according to the specified materialized expression when rows are inserted,
179+
and we can use them to create new columns with masked versions of the data.
180+
181+
Taking the example before, instead of creating a separate `VIEW` for the masked data, we'll now create masked columns using `MATERIALIZED`:
182+
183+
```sql
184+
DROP TABLE IF EXISTS orders;
185+
CREATE TABLE orders (
186+
user_id UInt32,
187+
name String,
188+
name_masked String MATERIALIZED replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****'),
189+
email String,
190+
email_masked String MATERIALIZED replaceRegexpOne(email, '^(.{0})[^@]*(@.*)$', '\\1****\\2'),
191+
phone String,
192+
phone_masked String MATERIALIZED replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3'),
193+
total_amount Decimal(10,2),
194+
order_date Date,
195+
shipping_address String,
196+
shipping_address_masked String MATERIALIZED replaceRegexpOne(shipping_address, '^[^,]+,\\s*(.*)$', '*** \\1')
197+
)
198+
ENGINE = MergeTree()
199+
ORDER BY user_id;
200+
201+
INSERT INTO orders VALUES
202+
(1001, 'John Smith', 'john.smith@gmail.com', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'),
203+
(1002, 'Sarah Johnson', 'sarah.johnson@outlook.com', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'),
204+
(1003, 'Michael Brown', 'mbrown@company.com', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'),
205+
(1004, 'Emily Rogers', 'emily.rogers@yahoo.com', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'),
206+
(1005, 'David Wilson', 'dwilson@email.net', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001');
207+
```
208+
209+
If you now run the following select query, you will see that the masked data is 'materialized' at insert time and stored alongside the original, unmasked data.
210+
It is necessary to explicitly select the masked columns as ClickHouse doesn't automatically include materialized columns in `SELECT *` queries by default.
211+
212+
```sql title="Query"
213+
SELECT
214+
*,
215+
name_masked,
216+
email_masked,
217+
phone_masked,
218+
shipping_address_masked
219+
FROM orders
220+
ORDER BY user_id ASC
221+
```
222+
223+
```response title="Response"
224+
┌─user_id─┬─name──────────┬─email─────────────────────┬─phone────────┬─total_amount─┬─order_date─┬─shipping_address───────────────────┬─name_masked──┬─email_masked───────┬─phone_masked─┬─shipping_address_masked────┐
225+
1. │ 1001 │ John Smith │ john.smith@gmail.com │ 555-123-4567 │ 299.99 │ 2024-01-15 │ 123 Main St, New York, NY 10001 │ John **** │ jo****@gmail.com │ 555-***-4567 │ **** New York, NY 10001 │
226+
2. │ 1002 │ Sarah Johnson │ sarah.johnson@outlook.com │ 555-987-6543 │ 149.5 │ 2024-01-16 │ 456 Oak Ave, Los Angeles, CA 90210 │ Sarah **** │ sa****@outlook.com │ 555-***-6543 │ **** Los Angeles, CA 90210 │
227+
3. │ 1003 │ Michael Brown │ mbrown@company.com │ 555-456-7890 │ 599 │ 2024-01-17 │ 789 Pine Rd, Chicago, IL 60601 │ Michael **** │ mb****@company.com │ 555-***-7890 │ **** Chicago, IL 60601 │
228+
4. │ 1004 │ Emily Rogers │ emily.rogers@yahoo.com │ 555-321-0987 │ 89.99 │ 2024-01-18 │ 321 Elm St, Houston, TX 77001 │ Emily **** │ em****@yahoo.com │ 555-***-0987 │ **** Houston, TX 77001 │
229+
5. │ 1005 │ David Wilson │ dwilson@email.net │ 555-654-3210 │ 449.75 │ 2024-01-19 │ 654 Cedar Blvd, Phoenix, AZ 85001 │ David **** │ dw****@email.net │ 555-***-3210 │ **** Phoenix, AZ 85001 │
230+
└─────────┴───────────────┴───────────────────────────┴──────────────┴──────────────┴────────────┴────────────────────────────────────┴──────────────┴────────────────────┴──────────────┴────────────────────────────┘
231+
```
232+
233+
To ensure that users are only able to access columns containing the masked data, you can again use [Role Based Access Control](/cloud/security/cloud-access-management/overview) to ensure that specific roles only have grants to select on masked columns from `orders`.
234+
235+
Recreate the role that we made previously:
236+
237+
```sql
238+
DROP ROLE IF EXISTS masked_order_viewer;
239+
CREATE ROLE masked_order_viewer;
240+
```
241+
242+
Next, grant `SELECT` permission to the `orders` table:
243+
244+
```sql
245+
GRANT SELECT ON orders TO masked_data_reader;
246+
```
247+
248+
Revoke access to any sensitive columns:
249+
250+
```sql
251+
REVOKE SELECT(name) ON orders FROM masked_data_reader;
252+
REVOKE SELECT(email) ON orders FROM masked_data_reader;
253+
REVOKE SELECT(phone) ON orders FROM masked_data_reader;
254+
REVOKE SELECT(shipping_address) ON orders FROM masked_data_reader;
255+
```
256+
257+
Finally, assign the role to the appropriate users:
258+
259+
```sql
260+
GRANT masked_orders_viewer TO your_user;
261+
```
262+
263+
In the case where you want to store only the masked data in the `orders` table,
264+
you can mark the sensitive unmasked columns as [`EPHEMERAL`](/sql-reference/statements/create/table#ephemeral),
265+
which will ensure that columns of this type are not stored in the table.
266+
267+
```sql
268+
DROP TABLE IF EXISTS orders;
269+
CREATE TABLE orders (
270+
user_id UInt32,
271+
name String EPHEMERAL,
272+
name_masked String MATERIALIZED replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****'),
273+
email String EPHEMERAL,
274+
email_masked String MATERIALIZED replaceRegexpOne(email, '^(.{2})[^@]*(@.*)$', '\\1****\\2'),
275+
phone String EPHEMERAL,
276+
phone_masked String MATERIALIZED replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3'),
277+
total_amount Decimal(10,2),
278+
order_date Date,
279+
shipping_address String EPHEMERAL,
280+
shipping_address_masked String MATERIALIZED replaceRegexpOne(shipping_address, '^([^,]+),\\s*(.*)$', '*** \\2')
281+
)
282+
ENGINE = MergeTree()
283+
ORDER BY user_id;
284+
285+
INSERT INTO orders (user_id, name, email, phone, total_amount, order_date, shipping_address) VALUES
286+
(1001, 'John Smith', 'john.smith@gmail.com', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'),
287+
(1002, 'Sarah Johnson', 'sarah.johnson@outlook.com', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'),
288+
(1003, 'Michael Brown', 'mbrown@company.com', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'),
289+
(1004, 'Emily Rogers', 'emily.rogers@yahoo.com', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'),
290+
(1005, 'David Wilson', 'dwilson@email.net', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001');
291+
```
292+
293+
If we run the same query as before, you'll now see that only the materialized masked data was inserted into the table:
294+
295+
```sql title="Query"
296+
SELECT
297+
*,
298+
name_masked,
299+
email_masked,
300+
phone_masked,
301+
shipping_address_masked
302+
FROM orders
303+
ORDER BY user_id ASC
304+
```
305+
306+
```response title="Response"
307+
┌─user_id─┬─total_amount─┬─order_date─┬─name_masked──┬─email_masked───────┬─phone_masked─┬─shipping_address_masked───┐
308+
1. │ 1001 │ 299.99 │ 2024-01-15 │ John **** │ jo****@gmail.com │ 555-***-4567 │ *** New York, NY 10001 │
309+
2. │ 1002 │ 149.5 │ 2024-01-16 │ Sarah **** │ sa****@outlook.com │ 555-***-6543 │ *** Los Angeles, CA 90210 │
310+
3. │ 1003 │ 599 │ 2024-01-17 │ Michael **** │ mb****@company.com │ 555-***-7890 │ *** Chicago, IL 60601 │
311+
4. │ 1004 │ 89.99 │ 2024-01-18 │ Emily **** │ em****@yahoo.com │ 555-***-0987 │ *** Houston, TX 77001 │
312+
5. │ 1005 │ 449.75 │ 2024-01-19 │ David **** │ dw****@email.net │ 555-***-3210 │ *** Phoenix, AZ 85001 │
313+
└─────────┴──────────────┴────────────┴──────────────┴────────────────────┴──────────────┴───────────────────────────┘
314+
```
315+
316+
## Use query masking rules for log data {#use-query-masking-rules}
317+
318+
For users of ClickHouse OSS wishing to mask log data specifically, you can make use of [query masking rules](/operations/server-configuration-parameters/settings#query_masking_rules) (log masking) to mask data.
319+
320+
To do so, you can define regular expression-based masking rules in the server configuration.
321+
These rules are applied to queries and all log messages before they are stored in server logs or system tables (such as `system.query_log`, `system.text_log`, and `system.processes`).
322+
323+
This helps prevent sensitive data from leaking into **logs** only.
324+
Note that it does not mask data in query results.
325+
326+
For example, to mask a social security number, you could add the following rule to your [server configuration](/operations/configuration-files):
327+
328+
```yaml
329+
<query_masking_rules>
330+
<rule>
331+
<name>hide SSN</name>
332+
<regexp>(^|\D)\d{3}-\d{2}-\d{4}($|\D)</regexp>
333+
<replace>000-00-0000</replace>
334+
</rule>
335+
</query_masking_rules>
336+
```

0 commit comments

Comments
 (0)