BigQuery Batch Query Connector

This connector captures data from BigQuery into Estuary collections by periodically executing queries and translating the results into JSON documents.

Prerequisites

To use this connector, you will need the following prerequisites:

A Google Cloud Project with BigQuery enabled
A Google Cloud Service Account with the "BigQuery User" and "BigQuery Data Viewer" roles in your GCP project
A Service Account Key to authenticate into your Service Account

See the setup guide for more information about how to create the required resources.

Setup

Follow the steps below to set up the BigQuery connector.

Service Account

To sync data from BigQuery, you need credentials for a Service Account with the "BigQuery User" and "BigQuery Data Viewer" roles. These roles grant the necessary permissions to run BigQuery jobs, discover tables within the dataset, and read the contents of those tables. It is recommended to create a dedicated Service Account to facilitate permission management and auditing. However, if you already have a Service Account with the correct permissions, you can use it.

Here's how to provision a suitable service account:

Follow Google Cloud Platform's instructions for Creating a Service Account.
Note down the ID of the service account you just created. Service Account IDs typically follow the format <account-name>@<project-name>.iam.gserviceaccount.com.
Follow Google Cloud Platform's instructions for Granting IAM Roles to the new service account. The "principal" email address should be the ID of the service account you just created, and the roles granted should be "BigQuery User" and "BigQuery Data Viewer".

Once you have a Service Account created, you can authenticate with either a Service Account Key or using Google Cloud IAM.

Service Account Key

Service Account Keys are used to authenticate as Google Service Accounts. To be able to utilize the permissions granted to the Service Account in the previous step, you'll need to provide its Service Account Key when creating the capture. It is a good practice, though not required, to create a new key for Estuary even if you're reusing a preexisting account.

To create a new key for a service account, follow Google Cloud Platform's instructions for Creating a Service Account Key. Be sure to create the key in JSON format. Once the linked instructions have been followed you should have a key file, which will need to be uploaded to Estuary when setting up your capture.

Google Cloud IAM

Google Cloud IAM authentication uses a Workflow Identity Pool which allows the connector to be configured without storing any secrets. To use this method, follow the steps in the GCP IAM guide and make note of the audience for the provider and the service account name.

Set up the BigQuery connector in Estuary

Log into your Estuary account.
In the left navigation bar, click on "Sources". In the top-left corner, click "New Capture".
Locate and select the "BigQuery" connector.
Enter a name and optional description for the capture task.
Enter the Project ID and Dataset name that you intend to capture from, and paste or upload the service account key in the appropriate field.
Click the "Next" button and wait while the connector automatically discovers the available tables in the specified project and dataset.
Select the tables you wish to capture from the bindings list.
For each binding you selected, you will likely wish to specify cursor columns and a shorter "Poll Interval" setting. Otherwise the default behavior will be to recapture the entire contents of the table, once per day.
Once you are satisfied with your binding selection, click the "Save and Publish" button.

Specifying Cursor Columns

This connector operates by periodically executing a SELECT * FROM table query and outputting the resulting rows as JSON documents into an Estuary collection. In some cases doing this once or twice a day is entirely sufficient, but when working with larger tables (or if a faster update rate is desired) it pays to manually configure cursor columns.

The cursor must be a column (or ordered tuple of columns) which is expected to strictly increase for newly added or updated rows. Common examples of suitable cursors include:

Update timestamps, which are often the best choice if available since they can often be used to identify changed rows as well as new insertions.
Creation timestamps, which can be used to identify newly added rows in append-only datasets but won't help to identify changes to preexisting rows.
Monotonically increasing IDs, which are another way of identifying newly added rows but often don't help with update detection.

When a cursor is specified, the update query will take the form SELECT * FROM $table WHERE $cursorName > $lastCursorValue ORDER BY $cursorName and the capture connector will keep track of the highest observed cursor value between polling intervals. If multiple cursor columns are specified, they will be treated as an ordered tuple of columns which collectively form the cursor, and the obvious lexicographic tuple ordering will apply.

Once you have specified a suitable cursor for a table, you will likely want to lower the polling interval for that binding. The default polling interval is "24h" to keep data volumes low, but once a cursor is specified there is usually no downside to frequent polling, so you may wish to lower the interval to "5m" or even "5s" for that table.

Configuration

You configure connectors either in the Estuary web app, or by directly editing the catalog specification file. See connectors to learn more about using connectors. The values and specification sample below provide configuration details specific to the BigQuery source connector.

Properties

Endpoint

Property	Title	Description	Type	Required/Default
`/project_id`	Project ID	The GCP project ID for the project containing the source BigQuery dataset	string	Required
`/dataset`	Dataset	The BigQuery dataset to discover tables within	string	Required
`/credentials`	Credentials	Credentials for authentication.	Credentials	Required
`/advanced/poll`	Poll Interval	How often to poll bindings (may be overridden for a specific binding)	string	`"24h"`

Credentials

Credentials for authenticating with GCP. Use one of the following sets of options:

Property	Title	Description	Type	Required/Default
`/auth_type`	Auth Type	Method to use for authentication.	string	Required: `CredentialsJSON`
`/credentials_json`	Service Account JSON	The service account JSON credentials to use for authorization.	string	Required

Property	Title	Description	Type	Required/Default
`/auth_type`	Auth Type	Method to use for authentication.	string	Required: `GCPIAM`
`/gcp_service_account_to_impersonate`	Service Account	GCP Service Account email for Cloud SQL IAM authentication	string	Required
`/gcp_workload_identity_pool_audience`	Workload Identity Pool Audience	GCP Workload Identity Pool Audience in the format `https://iam.googleapis.com/projects/123/locations/global/workloadIdentityPools/test-pool/providers/test-provider`	string	Required

Bindings

Property	Title	Description	Type	Required/Default
`/name`	Name	A name which uniquely identifies this binding.	string	Required
`/cursor`	Cursor	The column name(s) which should be used as the incremental capture cursor	array	`[]`
`/template`	Template	The query (template) which will be executed every polling interval	string	Required
`/poll`	Poll Interval	Override the global polling interval for this binding.	string	`""`

Sample

captures:
  ${PREFIX}/${CAPTURE_NAME}:
    endpoint:
      connector:
        image: ghcr.io/estuary/source-bigquery-batch:v1
        config:
          project_id: your-project
          dataset: analytics
          credentials:
            auth_type: CredentialsJSON
            credentials_json: <service account JSON credentials>
    bindings:
      - resource:
          name: events
          template: "SELECT * FROM analytics.events;"
        target: ${PREFIX}/${COLLECTION_NAME}

Query Templates

The query template property of a binding defines what query will be executed against the database, given inputs describing the configured cursor columns and whether any prior cursor state exists. The default template implements the behavior described in specifying cursor columns.

In principle you are free to modify this template to implement whatever query you need. You could for instance create a new binding which queries a view, or which performs a more complex analytics query. However this should not be combined with table auto-discovery in a single capture, as this can produce some counterintuitive results. Instead create two separate capture tasks from the same database, one for autodiscovered tables and a separate one with the setting "Automatically Add New Collections" disabled for your custom bindings.

Prerequisites​

Setup​

Service Account​

Service Account Key​

Google Cloud IAM​

Set up the BigQuery connector in Estuary​

Specifying Cursor Columns​

Configuration​

Properties​

Endpoint​

Credentials​

Bindings​

Sample​

Query Templates​