Define a policy to discover and label sensitive data including personally identifiable information, credentials, and even custom regular expressions inside text, logs, and other structured data.
classifyAPI policy structure has two notable sections. First, the
modelsarray will have one item that is keyed by
- This parameter can be overloaded via the command line interface (CLI)
- At this time,
csvand plain-text data formats are supported.
labelsarray is required to specify named entities to search for, including:
Within the config, you may optionally specify a
label_predictorsobject where you can define custom predictors that will create custom entity labels.
This example creates a custom regular expression for a custom user id format:
# ... classify model defined here ...
- score: high
If you wish to create custom predictors, you must provide a namespace which will be used when constructing the labels used.
regex: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the labels you wish to create. For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:
score: One of high, med, low. These map to floating point values of .8, ,5 and .2 respectively. If omitted the default is high.
regex: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.
In the example above, the namespace and the keys of the regex object are combined to create your custom labels. For above, the label
acme/user_idwill be created when a match occurs.
You can now combine the label_predictors with your classify policy. For example:
- score: high
use_nlp: trueto a classification model will enable entity predictions using natural language models.
Enabling this feature may be useful if you work with unstructured data and need to label names or locations such as addresses, states, or countries.
Enabling NLP predictions may decrease model prediction throughput by up to 70%.
Gretel currently uses spaCy for making NLP predictions. The following entity types are supported from the model:
Predictions produced by the spaCy model will be tagged with the source