This document provides a general overview for using AWS S3 as both the source and the sink for a Gretel Connector.
See Deploying On-Prem for step-by-step instructions for how to deploy a Gretel S3 Connector with Local Workers in your AWS environment.
The Gretel S3 Connector can be configured to continuously watch for new objects in a source S3 bucket, call Gretel Workers to transform records within those objects (for example to replace or encrypt PII), and write the results to a destination S3 bucket.
Below is an example S3 Connector config. In this pipeline all CSVs in the
my-connector-sourcebucket prefixed with
sandboxwill be transformed so any PII is removed. After the S3 object has been de-identified, the object will be written into the
my-connector-destinationbucket and prefixed with
- name: my_s3_source
- name: my_s3_sink
- name: default
bucket- The name of the source bucket to ingest data from.
path_prefix- Objects matching this prefix will be processed through the connector. If the object does not match the prefix, that object will be skipped.
glob_filter- Filters for objects matching a specific glob filter. This is useful for filtering objects by file type. If an object does not match the filter, it will be omitted from processing. Glob filters follow standard unix style pathname pattern expansion.
trigger- The S3 connector is built to continuously poll for new objects arriving in a bucket. A trigger config defines where to poll new objects. Currently, only SQS triggers are supported.
type: sqs- This configures the connector to continuously poll a SQS queue for new S3 change events.
endpoint- Specifies the SQS endpoint to poll for new events.
bucket- The destination bucket to write objects back to.
path_prefix- Rewrites the source object path to the specified path prefix.
Connectors support all of the same file types that are supported by the Gretel CLI, with a few limitations:
- Compressed CSV and JSON files are supported, but will arrive in the destination bucket uncompressed.
- If file types or schemas are inconsistent within a single pipeline (for example the source S3 bucket contains both CSV and Parquet files), you must choose to train a new model per file type or schema. If data sources are consistent, the same model can be re-used. Please see Specifying a Model for more information how to configure the connector model.