Create Your First Dataflow with Amazon S3 and Snowflake
In this tutorial, you'll create your first complete Data Flow with Estuary using publicly available data.
The dataset you'll use is composed of zipped CSV files in an Amazon S3 cloud storage bucket. You'll transport this data to a table in your own Snowflake data warehouse.
Prerequisites
You'll need:
-
An Estuary account. If you don't have one, visit Estuary's dashboard to register for free.
-
A Snowflake free trial account (or a full account). Snowflake trials are valid for 30 days.
Introduction
The data
New York City hosts the United States' largest bike share program, Citi Bike. Citi Bike shares ride data in CSV format with the public, including the starting and ending times and locations for every ride. They upload new data monthly to their Amazon S3 bucket as a zipped CSV file.
In this scenario, let's imagine you're interested in urban bike safety, or perhaps you plan to open a bike store and entice Citi Bike renters to buy their own bikes. You'd like to access the Citi Bike data in your Snowflake data warehouse. From there, you plan to use your data analytics platform of choice to explore the data, and perhaps integrate it with business intelligence apps.
You can use Estuary to build a real-time Data Flow that will capture all the new data from Citi Bike as soon as it appears, convert it to Snowflake's format, and land the data in your warehouse.
Estuary
In Estuary, you create Data Flows to connect data source and destination systems.
The simplest Data Flow comprises three types of entities:
-
A data capture, which ingests data from the source. In this case, you'll capture from Amazon S3.
-
One or more collections, which Estuary uses to store that data inside a cloud-backed data lake.
-
A materialization, to push the data to an external destination. In this case, you'll materialize to a Snowflake data warehouse.
For the capture and materialization to work, they need to integrate with outside systems: in this case, S3 and Snowflake, but many other systems can be used. To accomplish this, Estuary uses connectors. Connectors are plug-in components that interface between Estuary and an outside system. Today, you'll use Estuary's S3 capture connector and Snowflake materialization connector.
You'll start by creating your capture.
Capture Citi Bike data from S3
-
Go to Estuary's web app at dashboard.estuary.dev and sign in.
-
Click the Sources tab and choose New Capture
All of the available capture connectors — representing the possible data sources — appear as tiles.
-
Find the Amazon S3 tile and click Capture.
A form appears with the properties required for an S3 capture. Every connector requires different properties to configure.
First, you'll name your capture.
-
Click inside the Name box.
Names of entities in Estuary must be unique. They're organized by prefixes, similar to paths in a file system.
You'll see one or more prefixes pertaining to your organization. These prefixes represent the namespaces of Estuary to which you have access.
-
Click your prefix from the dropdown and append a unique name after it. For example,
myOrg/yourname/citibiketutorial. -
Next, fill out the required properties for S3.
-
AWS Region:
us-east-1 -
Bucket:
tripdata -
Prefix: The storage bucket isn't organized by prefixes, so leave this blank.
-
Match Keys:
2025.*The Citi Bike storage bucket has been around for a while. Some of the older datasets have incorrect file extensions or contain data in different formats. By selecting a subset of files from the year 2025, you'll make things easier to manage for the purposes of this tutorial. (In a real-world use case, you'd likely reconcile the different schemas of the various data formats using a derivation. Derivations are a more advanced Estuary skill.)
-
Authentication: The bucket is public, so select Anonymous authentication.
-
Parser Configuration: Estuary is often able to automatically determine how to parse incoming data. For this capture, however, you can specify Zip Archive as the compression type and CSV as the file type.
-
-
Click Next.
Estuary uses the configuration you provided to initiate a connection with S3. It generates a list of collections that will store the data inside Estuary. In this case, there's just one collection from the bucket.
Once this process completes, you can move on to the next step. If there's an error, go back and check your configuration.
-
Click Save and Publish.
Estuary deploys, or publishes, your capture. You'll see a notification when the this is complete.
A subset of data from the Citi Bike tripdata bucket has been captured to a collection. Now, you can materialize that data to Snowflake.
-
Click Materialize.
Prepare Snowflake to use with Estuary
Before you can materialize from Estuary to Snowflake, you need to complete some setup steps.
-
Leave the Estuary dashboard open. In a new window or tab, go to your Snowflake console.
If you're a new trial user, you should have received instructions by email. For additional help in this section, see the Snowflake documentation.
-
Create a new SQL worksheet if you don't have one open.
This provides an interface where you can run queries.
-
Paste the following script into the console:
set database_name = 'ESTUARY_DB';
set warehouse_name = 'ESTUARY_WH';
set estuary_role = 'ESTUARY_ROLE';
set estuary_user = 'ESTUARY_USER';
set estuary_schema = 'ESTUARY_SCHEMA';
-- create role and schema for Estuary
create role if not exists identifier($estuary_role);
grant role identifier($estuary_role) to role SYSADMIN;
-- Create snowflake DB
create database if not exists identifier($database_name);
use database identifier($database_name);
create schema if not exists identifier($estuary_schema);
-- create a user for Estuary
create user if not exists identifier($estuary_user)
default_role = $estuary_role
default_warehouse = $warehouse_name;
grant role identifier($estuary_role) to user identifier($estuary_user);
-- Estuary requires case-sensitive quoted identifiers (e.g. "_meta/op").
alter user identifier($estuary_user) set QUOTED_IDENTIFIERS_IGNORE_CASE = FALSE;
grant all on schema identifier($estuary_schema) to identifier($estuary_role);
-- create a warehouse for estuary
create warehouse if not exists identifier($warehouse_name)
warehouse_size = xsmall
warehouse_type = standard
auto_suspend = 60
auto_resume = true
initially_suspended = true;
-- grant Estuary role access to warehouse
grant USAGE
on warehouse identifier($warehouse_name)
to role identifier($estuary_role);
-- grant Estuary access to database
grant CREATE SCHEMA, MONITOR, USAGE on database identifier($database_name) to role identifier($estuary_role);
-- change role to ACCOUNTADMIN for STORAGE INTEGRATION support to Estuary (only needed for Snowflake on GCP)
use role ACCOUNTADMIN;
grant CREATE INTEGRATION on account to role identifier($estuary_role);
use role sysadmin;
COMMIT;
-
Click the drop-down arrow next to the Run button and click Run All.
-
Set up JWT authentication for the
estuary_user.You can generate a public/private key-pair in your terminal using:
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pubAssign the public key to
estuary_userin Snowflake:ALTER USER identifier($estuary_user) SET RSA_PUBLIC_KEY='MIIBIjANBgkqh...' -
Return to the Estuary dashboard.
Materialize your Estuary collection to Snowflake
When you clicked Materialize from your S3 capture, you were directed to the Materializations page. All of the available materialization connectors — representing the possible data destinations — are shown as tiles.
-
Find the Snowflake tile and click Materialization.
A new form appears with the properties required to materialize to Snowflake.
-
You will be prompted to select a default naming strategy for your materialization.
For this demo, select Set a default schema and enter the schema name you created using the SQL script in Snowflake. By default, this will be
ESTUARY_SCHEMA.Click Continue to set your selection.
-
In the Name box, select your prefix from the dropdown and append a unique name after it. For example,
myOrg/yourname/citibiketutorial. -
Next, fill out the required properties for Snowflake (most of these come from the script you just ran).
-
Host URL: This is the URL you use to log into Snowflake. If you recently signed up for a trial, it should be in your email. Omit the protocol from the beginning. For example,
ACCOUNTID.region.cloudprovider.snowflakecomputing.comororgname-accountname.snowflakecomputing.com. -
Database:
ESTUARY_DB -
Schema:
ESTUARY_SCHEMA -
Warehouse:
ESTUARY_WH -
Role:
ESTUARY_ROLE -
Timestamp Type: Choose how timestamp columns should be stored in Snowflake. Unless you have already explicitly set your
TIMESTAMP_TYPE_MAPPINGin Snowflake, you can default toTIMESTAMP_LTZfor this field. -
Authentication: Using Private Key (JWT) auth, enter your user (
ESTUARY_USER) and the private key you generated.
-
-
Scroll down to view the Source Collections section.
Estuary automatically infers the schema of incoming data. To check how your data will materialize in Snowflake, select your
tripdatacollection and view its Collection tab. You should see fields related to the start and end of bike trips and their inferred data types.You can also control which fields get materialized in the Resource Configuration tab under Field Selection.
-
Click Next.
Estuary uses the configuration you provided to initiate a connection with Snowflake and generate a specification with details of the materialization.
Once this process completes, you can move on to the next step. If there's an error, go back and check your configuration.
-
Click Save and Publish.
Estuary publishes the materialization.
-
Return to the Snowflake console and expand ESTUARY_DB and ESTUARY_SCHEMA. You'll find the materialized table there.
Conclusion
You've created a complete Data Flow that ingests the Citi Bike CSV files from an Amazon S3 bucket and materializes them into your Snowflake database.
When Citi Bike uploads new data, it'll be reflected in Snowflake in near-real-time, so long as you don't disable your capture or materialization.
Data warehouses like Snowflake are designed to power data analytics. From here, you can begin any number of analytical workflows.
Want to learn more?
-
For more information on the connectors you used today, see the pages on S3 and Snowflake.
-
You can create a Data Flow using any combination of supported connectors with a similar process to the one you followed in this tutorial. For a more generalized procedure, see the guide to create a Data Flow.