flow
Search…
Catalog spec
The central YAML file that defines and describes the components of the catalog is known as the catalog spec. However, Flow relies on multiple source files to build out the catalog. They can be of various types including YAML, JSON, and Typescript. Files can have many different sections, and a directory of files can be of multiple types. Ultimately, this makes Flow catalogs highly customizable.
The most common catalog spec sections you'll see and use are:
  • import
  • collections
  • materializations
  • captures
  • tests
Other catalog elements that you'll work with directly but are not defined in the main catalog spec YAML are:
  • Schemas: stored in a separate YAML as a best practice to allow re-use
  • Lambdas: stored in a separate Typescript file

Organizing the Flow catalog spec

It's not necessary to store the entire catalog spec in one YAML file, and Flow provides the flexibility to reference other files which can be managed independently. You may want to do so if:
  • You want to ensure shared collections remain easy to find
  • You use group data that's managed by different teams
  • You could benefit from DRY factoring things that are different per environment.
  • You need to manage sensitive credentials separately from materialization definitions

import

Flow's import directive can help you easily handle all of these scenarios while keeping your catalogs well organized. Each catalog spec file may import any number of other files, and each import may refer to either relative or an absolute URL.
When you use import in a catalog spec, you're conceptually bringing the entirety of another catalog — as well as the schemas and typescript files it uses — into your catalog. Imports are also transitive, so when you import another catalog, you're also importing everything that other catalog has imported. This allows you to keep your catalogs organized, and is flexible enough to support collaboration between separate teams and organizations.
Perhaps the best way of explaining this is with some examples.

Example: Organizing collections

Let's look at a relatively simple case in which you want to organize your collections into multiple catalog files. Say you work for Acme Corp on the team that's introducing Flow. You might start with the collections and directory structure below:
1
acme/customers/customerInfo
2
acme/products/info/manufacturers
3
acme/products/info/skus
4
acme/products/inventory
5
acme/sales/pending
6
acme/sales/complete
Copied!
1
acme
2
├── flow.yaml
3
├── customers
4
│ ├── flow.ts
5
│ ├── flow.yaml
6
│ └── schemas.yaml
7
├── products
8
│ ├── flow.yaml
9
│ ├── info
10
│ │ ├── flow.ts
11
│ │ ├── flow.yaml
12
│ │ └── schemas.yaml
13
│ └── inventory
14
│ ├── flow.ts
15
│ ├── flow.yaml
16
│ └── schemas.yaml
17
schemas.yaml
18
└── sales
19
├── flow.ts
20
├── flow.yaml
21
└── schemas.yaml
Copied!
It's immediately clear where each of the given collections is defined, since the directory names match the path segments in the collection names. This is not required by flowctl, but is strongly recommended, since it makes your catalogs more readable and maintainable. Each directory contains a flow.yaml file that will import all of the catalogs from child directories.
So, the top-level catalog spec, acme/flow.yaml, might look something like this:
1
import:
2
- customers/flow.yaml
3
- products/flow.yaml
4
- sales/flow.yaml
Copied!
This type of layout has a number of other advantages. During development, you can easily work with a subset of collections using, for example, flowctl test --source acme/products/flow.yaml to run only the tests for product-related collections. It also allows other imports to be more granular. For example, you might want a derivation under sales to read from acme/products/info. Since info has a separate catalog spec, acme/sales/flow.yaml can import acme/products/info/flow.yaml without creating a dependency on the inventory collection.

Example: Separate environments

It's common to use separate environments for tiers like development, staging, and production. Flow catalog specs often necessarily include endpoint configuration for external systems that will hold materialized views. Let's say you want your production environment to materialize views to Snowflake, but you want to develop locally on SQLite. We might modify the Acme example slightly to account for this.
1
acme
2
├── dev.flow.yaml
3
├── prod.flow.yaml
4
... the remainder is the same as above
Copied!
Each of the top-level catalog specs might import all of the collections and define an endpoint called ourMaterializationEndpoint that points to the desired system. The import block might be the same for each system, but each file may use a different configuration for the endpoint, which is used by any materializations that reference it.
Our configuration for our development environment will look like:
dev.flow.yaml
1
import:
2
- customers/flow.yaml
3
- products/flow.yaml
4
- sales/flow.yaml
5
6
ourMaterializationEndpoint:
7
# dev.flow.yaml
8
sqlite:
9
path: dev-materializations.db
Copied!
While production will look like:
prod.flow.yaml
1
import:
2
- customers/flow.yaml
3
- products/flow.yaml
4
- sales/flow.yaml
5
6
endpoints:
7
snowflake:
8
account: acme_production
9
role: admin
10
schema: snowflake.com/acmeProd
11
user: importantAdmin
12
password: abc123
13
warehouse: acme_production
Copied!
When we want to test locally, we simply run flowctl test dev.flow.yaml and when we push to production we'll likely run flowctl apply prod.flow.yaml.
From there, everything will continue to work because in our development environment we'll be binding collections to our local SQLite DB and in production we'll use Snowflake.

Example: Cross-team collaboration

When working across teams, it's common for one team to provide a data product for another to reference and use. Flow is designed for cross-team collaboration, allowing teams and users to reference each other's full catalog or schema.
Again using the Acme example, let's imagine we have two teams. Team Web is responsible for Acme's website, and Team User is responsible for providing a view of Acme customers that's always up to date. Since Acme wants a responsive site that provides a good customer experience, Team Web needs to pull the most up-to-date information from Team User at any point. Let's look at Team User's collections:
teamUser.flow.yaml
1
import:
2
- userProfile.flow.yaml
Copied!
Which references:
userProfile.flow.yaml
1
collection:
2
userProfile:
3
schema:
4
-"/userProfile/schema"
5
key:
6
[/id]
Copied!
Team User references files in their directory, which they actively manage in both their import and schema sections. If Team Web wants to access user data (and they have access), they can use a relative path or a URL-based path given that Team User publishes their data to a URL for access:
teamWeb.flow.yaml
1
import:
2
-http://www.acme.com/teamUser#userProfile.flow.yaml
3
-webStuff.flow.yaml
Copied!
Now Team Web has direct access to collections (referenced by their name) to build derived collections on top of. They can also directly import schemas:
webStuff.flow.yaml
1
collection:
2
webStuff:
3
schema:
4
-http://acme.com/teamUser#userProfile/#schema
5
key:
6
[/id]
Copied!

Global namespace

Every Flow collection has a name, and that name must be unique within a running Flow system. Flow collections should be thought of as existing within a global namespace. Keeping names globally unique makes it easy to import catalogs from other teams, or even other organizations, without having naming conflicts or ambiguities.
For example, imagine your catalog for the inside sales team has a collection just named customers. If you later try to import a catalog from the outside sales team that also contains a customers collection, 💥 there's a collision. A better collection name would be acme/inside-sales/customers. This allows a catalog to include customer data from separate teams, and also separate organizations.
Last modified 3mo ago