S3 Connector
This document provides a general overview for using AWS S3 as both the source and the sink for a Gretel Connector.
See Deploying On-Prem for step-by-step instructions for how to deploy a Gretel S3 Connector with Local Workers in your AWS environment.
The Gretel S3 Connector can be configured to continuously watch for new objects in a source S3 bucket, call Gretel Workers to transform records within those objects (for example to replace or encrypt PII), and write the results to a destination S3 bucket.
Below is an example S3 Connector config. In this pipeline all CSVs in the
my-connector-source
bucket prefixed with sandbox
will be transformed so any PII is removed. After the S3 object has been de-identified, the object will be written into the my-connector-destination
bucket and prefixed with output/sandbox
.version: 1
sources:
- name: my_s3_source
type: s3
config:
bucket: my-connector-source
path_prefix: sandbox
glob_filter: "*.csv"
trigger:
type: sqs
endpoint: https://sqs.us-east-2.amazonaws.com/123456789012/s3-connector-inbound
sinks:
- name: my_s3_sink
type: s3
config:
bucket: my-connector-destination
path_prefix: output/sandbox
connectors:
- name: default
version: dev
max_active: 1
source: my_s3_source
sink: my_s3_sink
model: transform/default
bucket
- The name of the source bucket to ingest data from.path_prefix
- Objects matching this prefix will be processed through the connector. If the object does not match the prefix, that object will be skipped.glob_filter
- Filters for objects matching a specific glob filter. This is useful for filtering objects by file type. If an object does not match the filter, it will be omitted from processing. Glob filters follow standard unix style pathname pattern expansion.trigger
- The S3 connector is built to continuously poll for new objects arriving in a bucket. A trigger config defines where to poll new objects. Currently, only SQS triggers are supported.type: sqs
- This configures the connector to continuously poll a SQS queue for new S3 change events.endpoint
- Specifies the SQS endpoint to poll for new events.
bucket
- The destination bucket to write objects back to.path_prefix
- Rewrites the source object path to the specified path prefix.
Connectors support all of the same file types that are supported by the Gretel CLI, with a few limitations:
- Compressed CSV and JSON files are supported, but will arrive in the destination bucket uncompressed.
- If file types or schemas are inconsistent within a single pipeline (for example the source S3 bucket contains both CSV and Parquet files), you must choose to train a new model per file type or schema. If data sources are consistent, the same model can be re-used. Please see Specifying a Model for more information how to configure the connector model.
Last modified 1yr ago