Redact Sensitive PII

In this tutorial, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the CLI to transform a dataset and examine the results.

Sample configuration

Save your configuration to a local file named redact_pii.yaml. Click the link to see all supported info types. The policy below searches for sensitive PII values as defined by Experian (including a custom regex for user IDs), replacing them with fake values when possible, or redacting with a user-defined character.
schema_version: "1.0"
name: "Redact PII"
- transforms:
data_source: "_"
- name: remove_pii
- name: fake_or_redact_pii
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
- type: fake
- type: redact_with_char
char: X
namespace: custom
- score: high
regex: "user_[\\d]{5}"
Save the sample dataset below to pii.csv
1,Kimberli Goodman,[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359

Create a transformation model

First, create a project to host your transformation models and artifacts.
gretel projects create --display-name redact-pii --set-default
Next, train your transformation model on your dataset or one with an identical schema.
Currently, only plain text and CSV formats are supported by the Transform API. JSON support is coming soon.
gretel models create --config redact_pii.yaml --in-data pii.csv --runner cloud > model-data.json

Redact sensitive data

Your model can now be used to redact sensitive data from any dataset with a similar structure or schema.
gretel records transform --model-id model-data.json --in-data pii.csv --runner cloud --output .

Examine the results

Transform results are downloaded to the local directory in CSV format to the file data.gz. Our policy is set to replace names, addresses, and emails with fake entities, and to redact the user ID regular expression with a character replacement.
Let's examine the transformed results from the command line.
zcat data.gz | column -s, -t
id name email phone visa ssn user_id
1 Samantha Sandoval [email protected] 986.089.1149 344661707423210 102-40-4854 XXXX_XXXXX
2 Shannon Holmes [email protected] (686)646-3171 3519277724227055 554-61-8106 XXXX_XXXXX
3 David Chapman [email protected] 001-946-130-7514x76773 213182470523001 008-06-5773 XXXX_XXXXX
4 Crystal Russo [email protected] 027-327-7306x07952 6011379376191328 628-27-4071 XXXX_XXXXX
5 John Allen [email protected] (365)502-6954 4047982390743587 740-42-9239 XXXX_XXXXX

Next steps

For use cases such as training machine learning models on customer support logs, it is often desirable to replace PII with fake values to maintain semantics in the original data. However, this is not always desirable. Try updating the transformation policy to simply redact all sensitive values with an "*" character.

Video walkthrough