Skip to main content

Google Firestore

This connector captures data from your Google Firestore collections into Flow collections.

ghcr.io/estuary/source-firestore:dev provides the latest connector image. You can also follow the link in your browser to see past image versions.

Data model

Firestore is a NoSQL database. Its data model consists of documents (lightweight records that contain mappings of fields and values) organized in collections.

Collections are organized hierarchically. A given document in a collection can, in turn, be associated with a subcollection.

For example, you might have a collection called users, which contains two documents, alice and bob. Each document has a subcollection called messages (for example, users/alice/messages), which contain more documents (for example, users/alice/messages/1).

users
├── alice
│ └── messages
│ ├── 1
│ └── 2
└── bob
└── messages
└── 1

The connector works by identifying documents associated with a particular sequence of Firestore collection names, regardless of documents that split the hierarchy. These document groupings are mapped to Flow collections using a path in the pattern collection/*/subcollection.

In this example, we'd end up with users and users/*/messages Flow collections, with the latter contain messages from both users. The /_meta/path property for each document contains its full, original path, so we'd still know which messages were Alice's and which were Bob's.

Prerequisites

You'll need:

Configuration

You configure connectors either in the Flow web app, or by directly editing the Flow specification file. See connectors to learn more about using connectors. The values and specification sample below provide configuration details specific to the Firestore source connector.

Properties

Endpoint

PropertyTitleDescriptionTypeRequired/Default
/googleCredentialsCredentialsGoogle Cloud Service Account JSON credentials.stringRequired
/databaseDatabaseOptional name of the database to capture from. Leave blank to autodetect. Typically "projects/$PROJECTID/databases/(default)".string

Bindings

PropertyTitleDescriptionTypeRequired/Default
/backfillModeBackfill ModeConfigures the handling of data already in the collection. See below for details or just stick with 'async'stringRequired
/pathPath to CollectionSupports parent/*/nested to capture all nested collections of parent's childrenstringRequired

Sample

captures:
${PREFIX}/${CAPTURE_NAME}:
endpoint:
connector:
image: ghcr.io/estuary/source-firestore:dev
config:
googleCredentials:
"type": "service_account",
"project_id": "project-id",
"private_key_id": "key-id",
"private_key": "-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n",
"client_email": "service-account-email",
"client_id": "client-id",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account-email"
bindings:
- resource:
#The below `path` will capture all Firestore documents that match the pattern
#`orgs/<orgID>/runs/<runID>/runResults/<runResultID>/queryResults`.
#See the Data Model section above for details.
path: orgs/*/runs/*/runResults/*/queryResults
backfillMode: async
target: ${PREFIX}/orgs_runs_runResults_queryResults
- resource:
path: orgs/*/runs/*/runResults
backfillMode: async
target: ${PREFIX}/orgs_runs_runResults
- resource:
path: orgs/*/runs
backfillMode: async
target: ${PREFIX}/orgs_runs
- resource:
path: orgs
backfillMode: async
target: ${PREFIX}/orgs

Backfill mode

In each captured collection's binding configuration, you can choose whether and how to backfill historical data. There are three options:

  • none: Skip preexisting data in the Firestore collection. Capture only new documents and changes to existing documents that occur after the capture is published.

  • async: Use two threads to capture data. The first captures new documents, as with none. The second progressively ingests historical data in chunks. This mode is most reliable for Firestore collections of all sizes but provides slightly weaker guarantees against data duplication.

    The connector uses a reduction to reconcile changes to the same document found on the parallel threads. The version with the most recent timestamp the document metadata will be preserved ({"strategy": "maximize", "key": "/_meta/mtime"}). For most collections, this produces an accurate copy of your Firestore collections in Flow.

  • sync: Request that Firestore stream all changes to the collection since its creation, in order.

    This mode provides the strongest guarantee against duplicated data, but can cause errors for large datasets. Firestore may terminate the process if the backfill of historical data has not completed within about ten minutes, forcing the capture to restart from the beginning. If this happens once it is likely to recur continuously. If left unattended for an extended time this can result in a massive number of read operations and a correspondingly large bill from Firestore.

    This mode should only be used when somebody can keep an eye on the backfill and shut it down if it has not completed within half an hour at most, and on relatively small collections. 100,000 documents or fewer should generally be safe, although this can vary depending on the average document size in the collection.

If you're unsure which backfill mode to use, choose async.