flow
Search…
Collections
How to define collections in the catalog spec
A collection is a set of related documents, where each adheres to a common schema and grouping key. Collections are append-only: once a document is added to a collection, it is never removed. However, it may be replaced or updated — either as a whole or in part — by a future document sharing its key.
Each new document of a given key is reduced into existing documents of the key. By default, Flow executes such a reduction by completely replacing the previous document, but you can specify much richer reduction behaviors by using annotated reduction strategies in the collection schema.

collections section

The collections section is a list of collection definitions within a catalog spec file. A collection must be defined before it can be used as a source or destination for a capture or materialization.
Derived collections may reference collections defined in other catalog sources, but are required to first import them, either directly or indirectly. Flow collections are objects that use the following entities:
1
# An object of collections to include in the catalog, where the key is the name of the collection,
2
# and the value is an object with its definition. You may define any number of collections.
3
collections:
4
# The user-defined name of the collection. Flow collections exist conceptually in a global
5
# namespace, so every collection must have a unique name. By convention, slashes are used to
6
# fully qualify collection names using path components. Collection names may not be changed.
7
# Names may include only unicode letters, numbers and symbols - no spaces or other special characters.
8
myOrg/myDomain/collectionName:
9
10
# The key of the collection defines the fields (that must exist) within each document that
11
# uniquely identify the entity to which it pertains. Each field that is part of the key must
12
# be guaranteed by the schema to always exist, and to have a single possible scalar type.
13
# The fields are specified each as a JSON Pointer.
14
# Required, type: array
15
key: [/json/ptr]
16
17
# Schema against which collection documents are validated and reduced.
18
# This should be a URI that points to a YAML or JSON file with the schema;
19
# defining the schema inline is discouraged. See below for more details.
20
# Required, type: string | object
21
schema: mySchemas.yaml#/$defs/myCollectionSchema
22
23
# Projections and logical partitions for this collection.
24
# See below for details.
25
# Optional, type: object
26
projections:
27
28
# Derivation that builds this collection from others through transformations. This defines
29
# how documents are derived from other collections. A collection without a derivation is
30
# referred to as an "ingested collection".
31
# See below for details.
32
# Optional, type: object
33
derivation:
34
Copied!

Projections

Projections are named locations within a collection document that may be used for logical partitioning or directly exposed to databases into which collections are materialized. Projections are objects that use the following entity structure:
1
a_field: "/json/ptr"
2
# JSON Pointer that identifies a location in a document.
3
# string, pattern: ^(/[^/]+)*
4
a_partition:
5
# type: object
6
7
# Entity that defines a partition.
8
location: "/json/ptr"
9
# type: string, pattern: ^(/[^/]+)*
10
11
# Location of this projection
12
partition: true
13
# type: boolean
14
15
# Is this projection a logical partition?
16
Copied!
You can learn more about projections in their conceptual documentation.
Details on the following sub-entities can be found on their pages:
The below is a simple example collection that can be defined in Flow. To show the complete example, the schema is shown inline, although in practice it is recommended to store schemas separately and use a URI.
1
collections:
2
examples/citi-bike/last-seen:
3
key: [/bike_id]
4
schema:
5
type: object
6
properties:
7
bike_id:
8
type: integer
9
last:
10
type: string
11
required: [bike_id, last]
Copied!
Last modified 3mo ago