Capture Multiple Paths with File Source Connectors
File source connectors like Amazon S3, Google Cloud Storage, SFTP, Google Drive, Azure Blob Storage, Dropbox, and HTTP File all support capturing from multiple paths within a single capture task. However, the Estuary web app only creates a single binding during initial setup. To add additional paths, you can use flowctl or the Advanced Specification Editor to manually configure extra bindings.
This is useful when you need to:
- Capture files from multiple directories or prefixes into separate collections (e.g.,
/invoices/and/receipts/into their own collections). - Capture files from multiple directories or prefixes into the same collection (e.g.,
/region-us/and/region-eu/merged together).
Prerequisites
- An existing file source capture created through the Estuary web app.
- flowctl installed and authenticated (for Options A and B via CLI).
Option A: Multiple paths into the same collection
Add extra bindings that capture from different paths but write to the same target collection.
You can also make this change directly in the web app using the Advanced Specification Editor instead of flowctl. Edit your capture, open the Advanced Specification Editor, and add new entries to the bindings array with a different stream value but the same target collection. This only works for Option A — Option B requires creating new collections via flowctl.
1. Pull the capture specification
flowctl catalog pull-specs --name your-org/your-capture
This creates a local flow.yaml file and subdirectories with your capture's specification.
2. Disable auto-discover
Open the capture YAML file. Auto-discover periodically re-runs discovery and rewrites the capture's bindings to match what the connector reports. File source connectors only ever discover a single binding (the bucket/prefix root), so on its next run auto-discover will remove the extra bindings you add below.
Setting addNewBindings to false does not prevent this — it only stops new bindings from being added, not existing ones from being removed. Remove the autoDiscover block entirely (or set it to null) so auto-discover never runs:
captures:
acmeCo/sftp-capture:
autoDiscover: null
endpoint: {...}
Disabling auto-discover does not affect schema inference, which is driven by your collection's read schema and continues regardless.
In the web app, this corresponds to unchecking Automatically keep schemas up to date (under Schema Evolution) when editing the capture.
3. Add a new binding
In the bindings array, add a new entry with a different stream value but the same target collection:
bindings:
# Existing binding
- resource:
stream: "invoices/2024"
target: your-org/your-collection
# New binding — different path, same target
- resource:
stream: "invoices/2025"
target: your-org/your-collection
4. Publish
flowctl catalog publish --source flow.yaml
Both paths now feed into the same collection.
Option B: Multiple paths into separate collections
Use this when you want each path captured into its own distinct collection.
1–2. Pull and configure auto-discover
Same as Option A above.
3. Add a new binding with a new target
Add a binding that points to a new collection name:
bindings:
- resource:
stream: "invoices/"
target: your-org/invoices
- resource:
stream: "receipts/"
target: your-org/receipts
4. Define the new collection
The new target collection must exist before you publish. Add its definition to your flow.yaml (or a file imported by it). The easiest approach is to copy the schema and key from your existing collection:
Copy the schema and key from your existing collection (found in the YAML files that pull-specs created):
collections:
your-org/receipts:
schema:
# Copy from your existing collection's schema
type: object
properties:
_meta:
type: object
properties:
file:
type: string
offset:
type: integer
required: [file, offset]
required: [_meta]
key: [/_meta/file, /_meta/offset]
If you're unsure what schema to use, pull the existing collection's spec and copy it:
flowctl catalog pull-specs --name your-org/your-existing-collection
5. Publish
flowctl catalog publish --source flow.yaml
Worked example: SFTP with two directories
This example captures CSV files from two directories on an SFTP server into separate collections.
captures:
acmeCo/sftp-capture:
endpoint:
connector:
image: "ghcr.io/estuary/source-sftp:dev"
config:
address: sftp.example.com:22
username: estuary
password: <SECRET>
directory: /data
parser:
format:
type: csv
config:
delimiter: ","
encoding: UTF-8
bindings:
- resource:
stream: /data/invoices
target: acmeCo/invoices
- resource:
stream: /data/receipts
target: acmeCo/receipts
collections:
acmeCo/invoices:
schema:
type: object
properties:
_meta:
type: object
properties:
file:
type: string
offset:
type: integer
required: [file, offset]
required: [_meta]
key: [/_meta/file, /_meta/offset]
acmeCo/receipts:
schema:
type: object
properties:
_meta:
type: object
properties:
file:
type: string
offset:
type: integer
required: [file, offset]
required: [_meta]
key: [/_meta/file, /_meta/offset]
Worked example: S3 with two prefixes
This example captures JSON files from two S3 prefixes into the same collection.
captures:
acmeCo/s3-capture:
endpoint:
connector:
image: "ghcr.io/estuary/source-s3:dev"
config:
bucket: acme-data-lake
region: us-east-1
credentials:
auth_type: AWSAccessKey
aws_access_key_id: <SECRET>
aws_secret_access_key: <SECRET>
parser:
format:
type: json
bindings:
- resource:
stream: acme-data-lake/events/region-us
target: acmeCo/all-events
- resource:
stream: acme-data-lake/events/region-eu
target: acmeCo/all-events
Connector-specific notes
Google Drive
For Google Drive, the stream value in each binding is the Google Drive folder ID — the long string at the end of the folder's URL (e.g., 1aBcDeFgHiJkLmNoPqRsTuV from https://drive.google.com/drive/folders/1aBcDeFgHiJkLmNoPqRsTuV).
Each binding can point to a different folder. The folderUrl in the endpoint config is the folder used during initial discovery, but bindings are not limited to that folder.
The folderUrl must use the format https://drive.google.com/drive/folders/FOLDER_ID. URLs with /u/0/ or /u/1/ (from Google's multi-account switcher) will be rejected — remove the /u/N segment if present.
SFTP
The stream value is the full path to the directory on the SFTP server (relative to the SFTP chroot). For example, if your SFTP directory config is /data, a binding might use stream: /data/invoices.
Amazon S3 and Google Cloud Storage
The stream value is formatted as bucket-name/prefix. For example: my-bucket/events/region-us.
Important caveats
-
Parser config is shared. All bindings in a capture share the same endpoint-level parser configuration (compression, format, CSV options, etc.). You cannot mix file formats within a single capture — for example, capturing CSV from one path and JSON from another requires two separate captures.
-
Disable auto-discover. File source discovery returns a single binding (the bucket/prefix root), so if auto-discover runs it will remove your manually-added bindings on the next discovery cycle. Setting
addNewBindings: falsedoes not prevent this — remove theautoDiscoverblock (or setautoDiscover: null). This does not affect schema inference, which is driven by the collection's read schema. -
Target collections must exist. When using Option B (separate collections), the target collection must be defined and published. If you reference a collection that doesn't exist, the publish will fail.