Amazon S3

This connector captures data from an Amazon S3 bucket.

It is available for use in the Flow web application. For local development or open-source workflows, ghcr.io/estuary/source-s3:dev provides the latest version of the connector as a Docker image. You can also follow the link in your browser to see past image versions.

Prerequisites

You can use this connector to capture data from an entire S3 bucket or for a prefix within a bucket. This bucket or prefix must be either be:

Publicly accessible and allowing anonymous reads.
Accessible via a root or IAM user.

In either case, you'll need an access policy. Policies in AWS are JSON objects that define permissions. You attach them to resources, which include both IAM users and S3 buckets.

See the steps below to set up access.

Setup: Public buckets

For a public bucket, the bucket access policy must allow anonymous reads on the whole bucket or a specific prefix.

Create a bucket policy using the templates below.

Anonymous reads policy - Full bucket
Anonymous reads policy - Specific prefix

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BucketAnonymousRead",
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE_BUCKET"
            ]
        },
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE_BUCKET/*"
            ]
        }
    ]
}

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BucketPrefixAnonymousRead",
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE_BUCKET"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "EXAMPLE_PREFIX",
                        "EXAMPLE_PREFIX/*"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE_BUCKET/EXAMPLE_PREFIX/*"
            ]
        }
    ]
}

Add the policy to your bucket. Paste over the existing policy and resolve any errors or warnings before saving.
Confirm that the Block public access setting on the bucket is disabled.

Setup: Accessing with a user account

For buckets accessed by a user account, you'll need the AWS access key and secret access key for the user. You'll also need to apply an access policy to the user to grant access to the specific bucket or prefix.

Create an IAM user if you don't yet have one to use with Flow.
Note the user's access key and secret access key. See the AWS blog for help finding these credentials.
Create an IAM policy using the templates below.

IAM user access policy - Full bucket
IAM user access policy - Specific prefix

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "UserAccessFullBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE_BUCKET"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE_BUCKET/*"
            ]
        }
    ]
}

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "UserAccessBucketPrefix",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE_BUCKET"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "EXAMPLE_PREFIX",
                        "EXAMPLE_PREFIX/*"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE_BUCKET/EXAMPLE_PREFIX/*"
            ]
        }
    ]
}

Add the policy to AWS.
Attach the policy to the IAM user.

Configuration

You configure connectors either in the Flow web app, or by directly editing the catalog specification file. See connectors to learn more about using connectors. The values and specification sample below provide configuration details specific to the S3 source connector.

Properties

Endpoint

Property	Title	Description	Type	Required/Default
`/advanced`		Options for advanced users. You should not typically need to modify these.	object
`/advanced/ascendingKeys`	Ascending Keys	Improve sync speeds by listing files from the end of the last sync, rather than listing the entire bucket prefix. This requires that you write objects in ascending lexicographic order, such as an RFC-3339 timestamp, so that key ordering matches modification time ordering. If data is not ordered correctly, using ascending keys could cause errors.	boolean	`false`
`/advanced/endpoint`	AWS Endpoint	The AWS endpoint URI to connect to. Use if you're capturing from a S3-compatible API that isn't provided by AWS	string
`/awsAccessKeyId`	AWS Access Key ID	Part of the AWS credentials that will be used to connect to S3. Required unless the bucket is public and allows anonymous listings and reads.	string
`/awsSecretAccessKey`	AWS Secret Access Key	Part of the AWS credentials that will be used to connect to S3. Required unless the bucket is public and allows anonymous listings and reads.	string
`/bucket`	Bucket	Name of the S3 bucket	string	Required
`/matchKeys`	Match Keys	Filter applied to all object keys under the prefix. If provided, only objects whose absolute path matches this regex will be read. For example, you can use ".*\.json" to only capture json files.	string
`/parser`	Parser Configuration	Configures how files are parsed (optional, see below)	object
`/parser/compression`	Compression	Determines how to decompress the contents. The default, 'Auto', will try to determine the compression automatically.	null, string	`null`
`/parser/format`	Format	Determines how to parse the contents. The default, 'Auto', will try to determine the format automatically based on the file extension or MIME type, if available.	object	`{"type":"auto"}`
`/parser/format/type`	Type		string
`/prefix`	Prefix	Prefix within the bucket to capture from. Use this to limit the data in your capture.	string
`/region`	AWS Region	The name of the AWS region where the S3 bucket is located. "us-east-1" is a popular default you can try, if you're unsure what to put here.	string	Required, `"us-east-1"`

Bindings

Property	Title	Description	Type	Required/Default
`/stream`	Prefix	Path to dataset in the bucket, formatted as `bucket-name/prefix-name`.	string	Required

Sample

captures:
  ${PREFIX}/${CAPTURE_NAME}:
    endpoint:
      connector:
        image: ghcr.io/estuary/source-s3:dev
        config:
          bucket: "my-bucket"
          parser:
            compression: zip
            format:
              type: csv
              config:
                delimiter: ","
                encoding: UTF-8
                errorThreshold: 5
                headers: [ID, username, first_name, last_name]
                lineEnding: "\\r"
                quote: "\""
          region: "us-east-1"
    bindings:
      - resource:
          stream: my-bucket/${PREFIX}
        target: ${PREFIX}/${COLLECTION_NAME}

Your capture definition may be more complex, with additional bindings for different S3 prefixes within the same bucket.

Learn more about capture definitions.

Advanced: Parsing cloud storage data

Cloud storage platforms like S3 can support a wider variety of file types than other data source systems. For each of these file types, Flow must parse and translate data into collections with defined fields and JSON schemas.

By default, the parser will automatically detect the type and shape of the data in your bucket, so you won't need to change the parser configuration for most captures.

However, the automatic detection may be incorrect in some cases. To fix or prevent this, you can provide explicit information in the parser configuration, which is part of the endpoint configuration for this connector.

The parser configuration includes:

Compression: Specify how the bucket contents are compressed. If no compression type is specified, the connector will try to determine the compression type automatically. Options are:
- zip
- gzip
- zstd
- none
Format: Specify the data format, which determines how it will be parsed. Options are:
- Auto: If no format is specified, the connector will try to determine it automatically.
- Avro
- CSV
- JSON
- Protobuf
- W3C Extended Log
info
At this time, Flow only supports S3 captures with data of a single file type. Support for multiple file types, which can be configured on a per-binding basis, will be added in the future.
For now, use a prefix in the endpoint configuration to limit the scope of each capture to data of a single file type.

CSV configuration

CSV files include several additional properties that are important to the parser. In most cases, Flow is able to automatically determine the correct values, but you may need to specify for unusual datasets. These properties are:

Delimiter. Options are:
- Comma (",")
- Pipe ("|")
- Space ("0x20")
- Semicolon (";")
- Tab ("0x09")
- Vertical tab ("0x0B")
- Unit separator ("0x1F")
- SOH ("0x01")
- Auto
Encoding type, specified by its WHATWG label.
Optionally, an Error threshold, as an acceptable percentage of errors. If set to a number greater than zero, malformed rows that fall within the threshold will be excluded from the capture.
Escape characters. Options are:
- Backslash ("\\")
- Disable escapes ("")
- Auto
Optionally, a list of column Headers, if not already included in the first row of the CSV file.

If any headers are provided, it is assumed that the provided list of headers is complete and authoritative. The first row of your CSV file will be assumed to be data (not headers), and you must provide a header value for every column in the file.
Line ending values
- CRLF ("\\r\\n") (Windows)
- CR ("\\r")
- LF ("\\n")
- Record Separator ("0x1E")
- Auto
Quote character
- Double Quote ("\"")
- Single Quote (")
- Disable Quoting ("")
- Auto

The sample specification above includes these fields.

Prerequisites​

Setup: Public buckets​

Setup: Accessing with a user account​

Configuration​

Properties​

Endpoint​

Bindings​

Sample​

Advanced: Parsing cloud storage data​

CSV configuration​