Balance a Dataset
In this deep dive, we will walk through using the Gretel CLI to create a synthetic model and generate synthetic records using a Gretel Worker in your own environment.
If you have not already gone through our environment setup tutorial, please do so, as this will enable you to run a Gretel Worker with GPU support on your own machine. If you do not have access to a GPU for training, the training data size and complexity of the model for this tutorial will work on a CPU, it will just take a little longer.
For this deep dive, we will use a reduced version of the US Census Income Data Set that is often used to predict if income is above $50k/year for adults in the United States.
In our partnership with Snowflake, Gretel has created a publicly available balanced version of this dataset in order to improve representation bias where algorithms trained on the task would inherently favor groups with greater representation to create a more fair, and less biased dataset.
In this tutorial, we will show you how to use the Gretel CLI and Smart Seeding to create your own synthetic records. Smart Seeding enables you to provide partial record values to the record generation process and our Gretel model will do the heavy lifting of creating the remainder of the record for you.
Let’s dive in!
First, we will create a synthetic model using a Gretel Worker that is local to the machine running the CLI. In local mode the training data will not be sent to Gretel Cloud and will reside only on the machine you are running the CLI from.
When running Gretel Workers in local mode, you can provide HTTPS or S3 links as direct input to the
in-data
param.First, we will download and modify one of Gretel’s configuration templates. We’ll use our default synthetic template and modify it to support smart seeding. When using smart seeding, you must provide the field names that you wish to use as seeds for generating records. We will use the following fields as seed fields: race, gender, and income_bracket.
Download and modify the default config template:
wget https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/config_templates/gretel/synthetics/default.yml
Now edit the configuration to enable Smart Seeding:
# Default configuration for Synthetic model creation.
# The parameter settings below match the default settings
# in Gretel's open source synthetic package
schema_version: 1.0
models:
- synthetics:
data_source: "__tmp__"
params:
epochs: 100
# NOTE: A synthetic task of type "seed" needs to be added to enable
# smart seeding during record generation
task:
type: seed
attrs:
fields:
- race
- gender
- income_bracket
Save this configuration locally. We’ll save it as
seed-config.yml
for this tutorial.Next, we’ll request a model creation job to be run in local mode. This will submit the configuration to the Gretel Cloud API and automatically trigger a download of the Gretel Synthetics container, load the config and training data, and start creating the synthetic model.
The synthetic model, sample data, and synthetic report will be saved to the local machine in the directory specified by the
output
parameter.gretel models create --config seed-config.yml --runner local --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv --output model-data
When this command is run, a model creation request will be sent to Gretel Cloud and a local Gretel Worker will be launched. The Gretel Worker will download the configuration from Gretel Cloud and load the training data and begin training the synthetic model.
When the model finishes creating be sure to note down the Model ID that is logged at the end. You will need this to serve the model for generation of new records!
When the model completes, several artifacts will be available in the output directory:
data_preview.gz
contains sample synthetic data. This data was created by the synthetic model and is used to create the Synthetic Quality Score (SQS) report.model.tar.gz
is the actual machine learning model. It should not have to be used directly.report.html.gz
is a human readable HTML page of the Synthetic Quality Score report.report_json.json.gz
is the same data from the SQS but in JSON format.
Next, we can serve our model to generate new synthetic records!
Next we'll use our newly created synthetic model and generate some new records. As discussed before, Gretel has already created a more balanced version of this dataset. We will now walk through creating some records that are partially complete.
When providing partial records for generation, you may only provide values for the seed fields that you defined in your configuration.
For demonstration, we'll assume we want to generate 100 new synthetic records where the values for
race
, gender
, and income_bracket
are "Black", "Female" and ">50K".In order to provide this seed data to our served model, we can create a CSV with just this data. We have already created a 3-column CSV for you.
Now, with our seed data, we can serve the model and generate new records with the following command. Be sure to replace your Model ID!
gretel records generate --runner local --model-id 60ba8dfb0ba87c111336cd9e --model-path model-data/model.tar.gz --output syn-records --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome-SeedFields.csv
After running this command, your previously created synthetic model will be loaded and the seed data CSV will be sent into the model server handler. These seed values will be used to generate new records.
When providing seed values, the number of records generated will be equal to the number of seed records provided.
Now let's examine our newly synthesized records:
gunzip -c syn-records/data.gz | head
You will see that every record generated has the 3 distinct values that were requested in our seed data!
Last modified 4mo ago