Create Synthetic Data
In this tutorial, we will walk through using the Gretel CLI to create a synthetic hospital electronic healthcare record (EHR) dataset.
If you haven’t already, install the Gretel CLI and SDK. Next, we will create a project to host your model and artifacts.
gretel projects create --display-name healthcare --set-default
Download and preview the dataset that we will be training a synthetic model on.
When specifying your own datasets, you will need enough variation in the data for the neural network to learn the structure and semantics of the data. An ideal dataset for synthetics is 1 to 50 columns of data and 1,000 or more rows of data.
wget https://gretel-public-website.s3.amazonaws.com/datasets/healthcare-analytics/hospital_ehr_data.csv
head -n 10 hospital_ehr_data.csv
The above command downloads and previews the dataset we will synthesize.
case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay 1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10 2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50 3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40 4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50 5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50 6,23,a,6,X,2,anesthesia,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,4449.0,11-20 7,32,f,9,Y,1,radiotherapy,S,B,3.0,31397,7.0,Emergency,Extreme,2,51-60,6167.0,0-10 8,23,a,6,X,4,radiotherapy,Q,F,3.0,31397,7.0,Trauma,Extreme,2,51-60,5571.0,41-50 9,1,d,10,Y,2,gynecology,R,B,4.0,31397,7.0,Trauma,Extreme,2,51-60,7223.0,51-60
Select a configuration template or download a template from our GitHub and make any modifications you’d like for your use case. We recommend the
default
template for most datasets. You will need the model-id outputted after training completes. gretel models create --runner cloud --config synthetics/high-field-count \
--in-data hospital_ehr_data.csv --output . > model-data.json
The
models
command outputs a JSON object to standard error that can be used by downstream commands in place of the model ID. In the example above, the output is being saved to model-data.json
.If the
--output
parameter is specified the above command will create several files in your local directory. For models trained in the Gretel Cloud, model artifacts can be downloaded at any time with the following command: gretel models get --model-id [model id] --output .
Filename | Description |
data_preview.gz | A preview of your synthetic dataset in CSV format. |
report.html.gz | HTML report that offers deep insight into the quality of the synthetic model. |
report-json.json.gz | A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically. |
logs.json.gz | Log output from the synthetic worker that is helpful for debugging. |
report.html
1MB
Text
Example synthetic data quality report
Now we will use our synthetic model to create a synthetic dataset. Copy the model ID returned by your
gretel models create
command.gretel records generate --model-id model-data.json --runner cloud \
--num-records 5000 --max-invalid 5000 --output .
If the
--output
parameter is specified the above command will create several files in your local directory.- data.gz - your synthetic dataset in csv format.
- logs.json.gz - Log output from the synthetic worker that is helpful for debugging.
Last modified 4mo ago