Captures

A capture is how Estuary ingests data from an external source. Every Data Flow starts with a capture.

In Estuary, captures are a type of task. They connect to an external data source, or endpoint, and bind one or more of its resources, such as database tables. Each binding adds documents to a corresponding Estuary collection.

Captures run continuously: as soon as new documents are made available at the endpoint resources, Estuary validates their schema and adds them to the appropriate collection. Captures can process documents up to 16 MB in size.

Estuary capture diagram

You can define and configure captures in Data Flow specifications.

See the guide to create a capture

Connectors

Captures extract data from an endpoint using a connector. Estuary builds and maintains many real-time connectors for various technology systems, such as database change data capture (CDC) connectors.

See the source connector reference documentation.

Batch sources

Estuary supports running both first and third party connectors to batch sources as well as natively-written Estuary connectors. These connectors tend to focus on SaaS APIs, and do not offer real-time streaming integrations. Estuary runs the connector at regular intervals to capture updated documents.

Third-party source connectors are independently reviewed and sometimes updated for compatibility with Estuary. Estuary's source connectors documentation includes all actively supported connectors. If you see a connector you'd like to prioritize for access in the web app, contact us.

Discovery

To help you configure new pull captures, Estuary offers the guided discovery workflow in the web application.

To begin discovery, you tell Estuary the connector you'd like to use and basic information about the endpoint. Estuary automatically generates a capture configuration for you. It identifies one or more resources — tables, data streams, or the equivalent — and generates bindings so that each will be mapped to a data collection.

You may then modify the generated configuration as needed before publishing the capture.

info

Discovers can also be run when editing an existing capture. This is commonly done in order to add new bindings, or update the collection specs and schemas associated with existing bindings.

Discovery through `flowctl`

You may also discover all currently available bindings through the flowctl CLI for a specific capture:

flowctl discover --source flow.yaml

Where the provided flow.yaml file includes your capture specification. If your chosen source flow.yaml includes more than one capture specification, you will need to specify one using the command's --capture parameter.

This command will invoke the capture's connector from your data plane in order to discover the capture's available bindings. Bindings and associated collections are then written back to your local source file.

If you are creating a new capture, you can simply leave the bindings stanza blank (bindings: []), and then fill in the details using the flowctl discover command.

Automatically update captures

You can choose to run periodic discovers in the background by adding the autoDiscover property to the capture. Estuary will periodically check for changes to the source and re-publish the capture to reflect those changes.

There are several options for controlling the behavior of autoDiscover:

The addNewBindings option determines whether to add newly discovered resources, such as database tables, to the capture as bindings. If set to false, autoCapture will only update the collection specs for existing bindings.
The evolveIncompatibleCollections option determines how to respond when the discovered updates would cause a breaking change to the collection. If true, it will trigger an evolution of the incompatible collection(s) to prevent failures.

In Estuary's web app, you can set these properties when you create or edit a capture.

Capture auto-discovery in the UI

The toggles in the web app correspond directly to the properties above:

"Automatically keep schemas up to date" enables autoDiscover
"Automatically add new collections" corresponds to addNewBindings
"Breaking changes re-versions collections" corresponds to evolveIncompatibleCollections

Specification

Captures are defined in Data Flow specification files per the following format:

# A set of captures to include in the catalog.
# Optional, type: object
captures:
  # The name of the capture.
  acmeCo/example/source-s3:
    # Automatically performs periodic discover operations, which updates the bindings
    # to reflect what's in the source, and also updates collection schemas.
    # To disable autoDiscover, either omit this property or set it to `null`.
    autoDiscover:
      # Also add any newly discovered bindings automatically
      addNewBindings: true
      # How to handle breaking changes to discovered collections. If true, then existing
      # materialization bindings will be re-created with new names, as necessary. Or if
      # collection keys have changed, then new Estuary collections will be created. If false,
      # then incompatible changes will simply result in failed publications, and will
      # effectively be ignored.
      evolveIncompatibleCollections: true

    # Endpoint defines how to connect to the source of the capture.
    # Required, type: object
    endpoint:
      # This endpoint uses a connector provided as a Docker image.
      connector:
        # Docker image that implements the capture connector.
        image: ghcr.io/estuary/source-s3:v2
        # File that provides the connector's required configuration.
        # Configuration may also be presented inline.
        config: path/to/connector-config.yaml

    # Bindings define how collections are populated from the data source.
    # A capture may bind multiple resources to different collections.
    # Required, type: array
    bindings:
      - # The target collection to capture into.
        # This may be defined in a separate, imported specification file.
        # Required, type: string
        target: acmeCo/example/collection

        # The resource is additional configuration required by the endpoint
        # connector to identify and capture a specific endpoint resource.
        # The structure and meaning of this configuration is defined by
        # the specific connector.
        # Required, type: object
        resource:
          stream: a-bucket/and-prefix
          # syncMode should be set to incremental for all Estuary connectors
          syncMode: incremental

      - target: acmeCo/example/another-collection
        resource:
          stream: a-bucket/another-prefix
          syncMode: incremental

    # Interval of time between invocations of non-streaming connectors.
    # If a connector runs to completion and then exits, the capture task will
    # restart the connector after this interval of time has elapsed.
    #
    # Intervals are relative to the start of an invocation and not its completion.
    # For example, if the interval is five minutes, and an invocation of the
    # capture finishes after two minutes, then the next invocation will be started
    # after three additional minutes.
    #
    # Optional. Default: Five minutes.
    interval: 5m

Connectors​

Batch sources​

Discovery​

Discovery through flowctl​

Automatically update captures​

Specification​