Create Synthetic Data

In this tutorial, we will walk through using the Gretel CLI to create a synthetic hospital electronic healthcare record (EHR) dataset.

Set up your project

If you haven’t already, install the Gretel CLI and SDK. Next, we will create a project to host your model and artifacts.
gretel projects create --display-name healthcare --set-default
Download and preview the dataset that we will be training a synthetic model on.
When specifying your own datasets, you will need enough variation in the data for the neural network to learn the structure and semantics of the data. An ideal dataset for synthetics is 1 to 50 columns of data and 1,000 or more rows of data.
head -n 10 hospital_ehr_data.csv
The above command downloads and previews the dataset we will synthesize.
case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay 1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10 2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50 3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40 4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50 5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50 6,23,a,6,X,2,anesthesia,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,4449.0,11-20 7,32,f,9,Y,1,radiotherapy,S,B,3.0,31397,7.0,Emergency,Extreme,2,51-60,6167.0,0-10 8,23,a,6,X,4,radiotherapy,Q,F,3.0,31397,7.0,Trauma,Extreme,2,51-60,5571.0,41-50 9,1,d,10,Y,2,gynecology,R,B,4.0,31397,7.0,Trauma,Extreme,2,51-60,7223.0,51-60

Train a synthetic model

Select a configuration template or download a template from our GitHub and make any modifications you’d like for your use case. We recommend the default template for most datasets. You will need the model-id outputted after training completes.
gretel models create --runner cloud --config synthetics/high-field-count \
--in-data hospital_ehr_data.csv --output . > model-data.json
The models command outputs a JSON object to standard error that can be used by downstream commands in place of the model ID. In the example above, the output is being saved to model-data.json.
If the --output parameter is specified the above command will create several files in your local directory. For models trained in the Gretel Cloud, model artifacts can be downloaded at any time with the following command: gretel models get --model-id [model id] --output .
A preview of your synthetic dataset in CSV format.
HTML report that offers deep insight into the quality of the synthetic model.
A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.
Log output from the synthetic worker that is helpful for debugging.
Example synthetic data quality report

Generate a synthetic dataset

Now we will use our synthetic model to create a synthetic dataset. Copy the model ID returned by your gretel models create command.
gretel records generate --model-id model-data.json --runner cloud \
--num-records 5000 --max-invalid 5000 --output .
If the --output parameter is specified the above command will create several files in your local directory.
  • data.gz - your synthetic dataset in csv format.
  • logs.json.gz - Log output from the synthetic worker that is helpful for debugging.

Video walkthrough