Skip to main content

Amazon S3 to Snowflake

This guide walks you through the process of creating an end-to-end real-time Data Flow from Amazon S3 to Snowflake using Estuary Flow.

Prerequisites

You'll need:

  • (Recommended) understanding of the basic Flow concepts.

  • Access to the Flow web application through an Estuary account. If you don't have one, visit the web app to register for free.

  • An S3 bucket that contains the data you'd like to move to Snowflake.

    • For public buckets, verify that the access policy allows anonymous reads.

    • For buckets accessed by a user account, you'll need the AWS access key and secret access key for the user. See the AWS blog for help finding these credentials.

  • A Snowflake account with:

Introduction

In Estuary Flow, you create Data Flows to transfer data from source systems to destination systems in real time. In this use case, your source is an Amazon S3 bucket and your destination is a Snowflake data warehouse.

After following this guide, you'll have a Data Flow that comprises:

  • A capture, which ingests data from S3
  • A collection, a cloud-backed copy of that data in the Flow system
  • A materialization, which pushes the data to Snowflake

The capture and materialization rely on plug-in components called connectors. We'll walk through how to configure the S3 and Snowflake connectors to integrate these systems with Flow.

Capture from S3

You'll first create a capture to connect to your S3 bucket, which will yield one or more Flow collections.

  1. Go to the Flow web application at dashboard.estuary.dev and sign in using the credentials provided by your Estuary account manager.

  2. Click the Sources tab and choose New Capture.

  3. Find the Amazon S3 tile and click Capture.

A form appears with the properties required for an S3 capture.

  1. Type a name for your capture.

    Your capture name must begin with a prefix to which you have access.

    In the Name field, use the drop-down to select your prefix. Append a unique capture name after the / to create the full name, for example, acmeCo/myS3Capture.

  2. Fill out the required properties for S3.

    • AWS Access Key ID and AWS Secret Access Key: Required for private buckets.

    • AWS Region and Bucket: These are listed in your S3 console.

    • Prefix: You might organize your S3 bucket using prefixes, which emulate a directory structure. To capture only from a specific prefix, add it here.

    • Match Keys: Filters to apply to the objects in the S3 bucket. If provided, only data whose absolute path matches the filter will be captured. For example, *\.json will only capture JSON file.

    See the S3 connector documentation for information on advanced fields and parser settings. (You're unlikely to need these for most use cases.)

  3. Click Next.

Flow uses the provided configuration to initiate a connection to S3.

It generates a permissive schema and details of the Flow collection that will store the data from S3.

You'll have the chance to tighten up each collection's JSON schema later, when you materialize to Snowflake.

  1. Click Save and publish.

You'll see a notification when the capture publishes successfully.

The data currently in your S3 bucket has been captured, and future updates to it will be captured continuously.

  1. Click Materialize Collections to continue.

Materialize to Snowflake

Next, you'll add a Snowflake materialization to connect the captured data to its destination: your data warehouse.

  1. Locate the Snowflake tile and click Materialization.

A form appears with the properties required for a Snowflake materialization.

  1. Choose a unique name for your materialization like you did when naming your capture; for example, acmeCo/mySnowflakeMaterialization.

  2. Fill out the required properties for Snowflake (you should have most of these handy from the prerequisites).

    • Host URL
    • Account
    • User
    • Password
    • Database
    • Schema
    • Warehouse: optional
    • Role: optional
  3. Click Next.

Flow uses the provided configuration to initiate a connection to Snowflake.

You'll be notified if there's an error. In that case, fix the configuration form or Snowflake setup as needed and click Next to try again.

Once the connection is successful, the Endpoint Config collapses and the Source Collections browser becomes prominent. It shows the collection you captured previously, which will be mapped to a Snowflake table.

  1. In the Collection Selector, optionally change the name in the Table field.

    This will be the name of the output table in Snowflake.

  2. Choose whether to enable delta updates.

  3. Apply a stricter schema to the collection for the materialization.

S3 has a flat data structure. To materialize this data effectively to Snowflake, you should apply a schema that can translate to a table structure. Flow's Schema Inference tool can help.

  1. In the Source Collections browser, click the collection's Collection tab.

  2. Click Schema Inference

    The Schema Inference window appears. Flow scans the data in your collection and infers a new schema, called the readSchema, to use for the materialization.

  3. Review the new schema and click Apply Inferred Schema.

  4. Click Next to apply the changes you made to the collection.

  5. Click Save and Publish. You'll see a notification when the full Data Flow publishes successfully.

What's next?

Your Data Flow has been deployed, and will run continuously until it's stopped. Updates in your S3 bucket will be reflected in your Snowflake table as they occur.

You can advance your Data Flow by adding a derivation. Derivations are real-time data transformations. See the guide to create a derivation.