Skip to main content

Amazon Redshift

This connector materializes Flow collections into tables in an Amazon Redshift database.

The connector uses your AWS account to materialize to Redshift tables by way of files in an S3 bucket. The files in the bucket as as a temporary staging area for data storage and retrieval.

ghcr.io/estuary/materialize-redshift:dev provides the latest connector image. You can also follow the link in your browser to see past image versions.

Prerequisites

To use this connector, you'll need:

  • A Redshift cluster accessible either directly or using an SSH tunnel. The user configured to connect to Redshift must have at least "create table" permissions for the configured schema. The connector will create new tables in the database per your specification. Tables created manually in advance are not supported. See setup for more information.
  • An S3 bucket for staging temporary files. For best performance the bucket should be in the same region as your Redshift cluster. See this guide for instructions on setting up a new S3 bucket.
  • An AWS root or IAM user with read and write access to the S3 bucket. For this user, you'll need the access key and secret access key. See the AWS blog for help finding these credentials.

Configuration

Use the below properties to configure an Amazon Redshift materialization, which will direct one or more of your Flow collections to your desired tables in the database.

Properties

Endpoint

PropertyTitleDescriptionTypeRequired/Default
/addressAddressHost and port of the database. Example: red-shift-cluster-name.account.us-east-2.redshift.amazonaws.com:5439stringRequired
/userUserDatabase user to connect as.stringRequired
/passwordPasswordPassword for the specified database user.stringRequired
/databaseDatabaseName of the logical database to materialize to. The materialization will attempt to connect to the default database for the provided user if omitted.string
/schemaDatabase SchemaDatabase schema for bound collection tables (unless overridden within the binding resource configuration) as well as associated materialization metadata tables.string"public"
/awsAccessKeyIdAccess Key IDAWS Access Key ID for reading and writing data to the S3 staging bucket.stringRequired
/awsSecretAccessKeySecret Access KeyAWS Secret Access Key for reading and writing data to the S3 staging bucket.stringRequired
/bucketS3 Staging BucketName of the S3 bucket to use for staging data loads.stringRequired
/regionRegionRegion of the S3 staging bucket. For optimal performance this should be in the same region as the Redshift database cluster.stringRequired
/bucketPathBucket PathA prefix that will be used to store objects in S3.string
/advancedAdvanced OptionsOptions for advanced users. You should not typically need to modify these.object
/advanced/updateDelayUpdate DelayPotentially reduce active cluster time by increasing the delay between updates. Defaults to 30 minutes if unset.string

Bindings

PropertyTitleDescriptionTypeRequired/Default
/tableTableName of the database table.stringRequired
/delta_updatesDelta UpdateShould updates to this table be done via delta updates. Default is false.booleanfalse
/schemaAlternative SchemaAlternative schema for this table (optional).string

Sample

materializations:
${PREFIX}/${mat_name}:
endpoint:
connector:
image: "ghcr.io/estuary/materialize-redshift:dev"
config:
address: "redshift-cluster.account.us-east-2.redshift.amazonaws.com:5439"
user: user
password: password
database: db
awsAccessKeyId: access_key_id
awsSecretAccessKey: secret_access_key
bucket: my-bucket
region: us-east-2
bindings:
- resource:
table: ${TABLE_NAME}
source: ${PREFIX}/${COLLECTION_NAME}

Setup

You must configure your cluster to allow connections from Estuary. This can be accomplished by making your cluster accessible over the internet for Estuary Flow's IP address 34.121.207.128, or using an SSH tunnel. Connecting to the S3 staging bucket does not use the network tunnel and connects over HTTPS only.

Instructions for making a cluster accessible over the internet can be found here. When using this option, database connections are made over SSL only.

For allowing secure connections via SSH tunneling:

  1. Refer to the guide to configure an SSH server on using an AWS EC2 instance.

  2. Configure your connector as described in the configuration section above, with the additional of the networkTunnel stanza to enable the SSH tunnel, if using. See Connecting to endpoints on secure networks for additional details and a sample.

Naming Conventions

Redshift has requirements for names and identifiers and this connector will automatically apply quoting when needed. All table identifiers and column identifiers (corresponding to Flow collection fields) are treated as lowercase. Table names for bindings must be unique on a case-insensitive basis, as well as field names of the source collection. If any names are not unique on a case-insensitive basis (ex: myField vs. MyField) the materialization will fail to apply.

If necessary, you can add projections to your collection specification to change field names.

Performance considerations

For best performance there should at most one Redshift materialization active per Redshift schema. Additional collections to be materialized should be added as bindings to this single materialization rather than creating a separate materialization for each collection.

In order to achieve exactly-once processing of collection documents, the materialization creates and uses metadata tables located in the schema configured by the endpoint schema property. To commit a transaction, a table-level lock is acquired on these metadata tables. If there are multiple materializations using the same metadata tables, they will need to take turns acquiring these locks. This locking behavior prevents "serializable isolation violation" errors in the case of multiple materializations sharing the same metadata tables at the expense of allowing only a single materialization to be actively committing a transaction.

Maximum record size

The maximum size of a single input document is 4 MB. Attempting to materialize collections with documents larger than 4 MB will result in an error. To materialize this data you can use a derivation to create a derived collection with smaller documents, or exclude fields containing excessive amounts of data by customizing the materialized fields.

Delta updates

This connector supports both standard (merge) and delta updates. The default is to use standard updates.