Skip to main content

Apache Iceberg Tables in Amazon S3

This connector materializes delta updates of Flow collections into Apache Iceberg tables using Amazon S3 for object storage and AWS Glue as the Iceberg catalog.

The delta updates are batched within Flow, converted to parquet files, and then append to Iceberg tables at a time interval that you set.

ghcr.io/estuary/materialize-s3-iceberg:dev provides the latest connector image. You can also follow the link in your browser to see past image versions.

Prerequisites

To use this connector, you'll need:

  • An S3 bucket to write files to. See this guide for instructions on setting up a new S3 bucket.
  • An AWS root or IAM user with read and write access to the S3 bucket. For this user, you'll need the access key and secret access key. See the AWS blog for help finding these credentials.

If using the AWS Glue Catalog:

  • The AWS root or IAM user must have access to AWS Glue. See this guide for instructions on setting up IAM permissions for a user to access AWS Glue.

If using the REST Catalog:

  • The URI for connecting to the catalog.
  • The name of the warehouse to connect to.
  • Credentials for connecting to the catalog.

Configuration

Use the below properties to configure the materialization, which will direct one or more of your Flow collections to your tables.

Properties

Endpoint

PropertyTitleDescriptionTypeRequired/Default
/aws_access_key_idAWS Access Key IDAccess Key ID for accessing AWS services.stringRequired
/aws_secret_access_keyAWS Secret Access keySecret Access Key for accessing AWS services.stringRequired
/bucketBucketThe S3 bucket to write data files to.stringRequired
/prefixPrefixOptional prefix that will be used to store objects.string
/regionRegionAWS Region.stringRequired
/namespaceNamespaceNamespace for bound collection tables (unless overridden within the binding resource configuration).stringRequired
/upload_intervalUpload IntervalFrequency at which files will be uploaded. Must be a valid ISO8601 duration string no greater than 4 hours.stringPT5M
/upload_intervalUpload IntervalFrequency at which files will be uploaded. Must be a valid ISO8601 duration string no greater than 4 hours.stringPT5M
/catalog/catalog_typeCatalog TypeEither "Iceberg REST Server" or "AWS Glue".stringRequired
/catalog/uriURIURI identifying the REST catalog, in the format of 'https://yourserver.com/catalog'.stringRequired
/catalog/credentialCredentialCredential for connecting to the REST catalog.string
/catalog/tokenTokenToken for connecting to the TEST catalog.string
/catalog/warehouseWarehouseWarehouse to connect to in the REST catalog.stringRequired

Bindings

PropertyTitleDescriptionTypeRequired/Default
/tableTableName of the database table.stringRequired
/namespaceAlternative NamespaceAlternative namespace for this table (optional).string
/delta_updatesDelta UpdatesShould updates to this table be done via delta updates. Currently this connector only supports delta updates.booltrue

Sample

materializations:
${PREFIX}/${mat_name}:
endpoint:
connector:
image: "ghcr.io/estuary/materialize-s3-iceberg:dev"
config:
aws_access_key_id: <access_key_id>
aws_secret_access_key: <secret_access_key>
bucket: bucket
region: us-east-2
namespace: namespace
upload_interval: PT5M
bindings:
- resource:
table: ${COLLECTION_NAME}
delta_updates: true
source: ${PREFIX}/${COLLECTION_NAME}

Iceberg Column Types

Flow collection fields are written to Iceberg table columns based on the data type of the field. Iceberg V2 primitive type columns are created for these Flow collection fields:

Collection Field Data TypeIceberg Column Type
arraystring
objectstring
booleanboolean
integerlong
numberdouble
string with {contentEncoding: base64}binary
string with {format: date-time}timestamptz (with microsecond precision)
string with {format: date}date
string with {format: integer}long
string with {format: number}double
string (all others)string

Flow collection fields with {type: string, format: time} and {type: string, format: uuid} are materialized as string columns rather than time and uuid columns for compatibility with Apache Spark. Nested types are not currently supported.

Table Maintenance

To ensure optimal query performance, you should conduct regular maintenance for your materialized tables since the connector will not perform this maintenance automatically (support for automatic table maintenance is planned).

If you're using the AWS Glue catalog, you can enable automatic data file compaction by following this guide.

At-Least-Once Semantics

In rare cases, it may be possible for documents from a source collection to be appended to a target table more than once. Users of materialized tables should take this possibility into consideration when querying these tables.