Create a real-time materialized view in PostgreSQL
PostgreSQL supports materialized views: database objects that contain the result of a query, usually a focused subset of a large dataset.
In this tutorial, you'll use Flow and your Postgres instance to create something that's not possible in Postgres alone: a materialized view that updates continuously based on a real-time data feed.
Prerequisites
An Estuary Flow account. If you don't have one, visit the Flow web app to register for free.
The flowctl CLI installed (for the optional section).
A Postgres database for testing set up to allow connections from Flow. Amazon RDS, Amazon Aurora, Google Cloud SQL, Azure Database for PostgreSQL, and self-hosted databases are supported.
Introduction
Materialized views in Postgres give you a powerful way narrow down a huge dataset into a compact one that you can easily monitor. But if your data is updating in real-time, traditional materialized views introduce latency. They're batch workflows — the query is run at a set interval.
To get around this, you'll need to perform a real-time transformation elsewhere. Flow derivations are a great way to do this.
For this example, you'll use Estuary's public data collection of recent changes to Wikipedia, captured from the Wikimedia Foundation's event stream.
The raw dataset is quite large. It captures every change to the platform — about 30 per second — and includes various properties. Written to a Postgres table, it quickly grows to an size that's very expensive to query.
First, you'll scope the raw data down to a small fact table with a derivation.
You'll then materialize both the raw and transformed datasets to your Postgres instance and compare performance.
Add a derivation to transform data.
Derivations are currently available in the flowctl CLI. If you'd prefer to only work in the web app today, you can skip to the next section. Estuary provides a pre-computed copy of the derivation that you can use to compare performance.
On the CLI-API tab in the Flow web app, copy the access token.
Authorize Flow in your local development environment:
flowctl auth token --token <your-token-here>
Next, pull the raw wikipedia collection.
flowctl catalog pull-specs --name estuary/public/wikipedia/recentchange
Source files are written to your current working directory.
Open Estuary > Public > Wikipedia and examine the contents of
flow.yaml
andrecentchange.schema.yaml
.The collection is keyed on its metadata, so every new change event is seen as unique. Its schema has many fields. This would yield a large, unwieldy table in Postgres.
Learn more about Flow collections and schemas.
Next, you'll add the derivation. Technically, a derivation is a new collection that contains a transformation within it. First, you'll define the collection. Then, you'll flesh out the transformation.
Create a new file called
fact-table.flow.yaml
and add a new collection called<your-prefix>/wikipedia/user-fact-table
.
Your prefix is likely your organization name. You can find it in the web app's admin tab. You must have write or admin access to create a collection in the prefix.
Copy the sample below:
---
collections:
yourprefix/wikipedia/user-fact-table:
schema:
properties:
count:
reduce:
strategy: sum
type:
- integer
last_updated:
format: date
type: string
userid:
type:
- integer
reduce:
strategy: merge
required:
- last_updated
- userid
type: object
key:
- /userid
- /last_updated
derivation:
register:
schema:
allOf:
- true
initial: ~
transform:
new_fact_table:
source:
name: estuary/public/wikipedia/recentchange
publish:
lambda: typescript
typescript:
module: user-fact-table.ts
The new collection's schema contains reduction annotations. These merge the data based on the user ID and the date they were last updated.
Generate a TypeScript file for the derivation's transformation function.
flowctl typescript generate --source flow.yaml
Open
user-fact-table.ts
. It contains a stubbed-out transformation. You'll populate it with a function that counts the number of changes associated with each user on a given date and converts the timestamp in the source data to a familiar date format.Copy and paste from the below sample (beginning at line 4):
import { IDerivation, Document, Register, NewFactTableSource } from 'flow/yourprefix/wikipedia/user-fact-table';
// Implementation for derivation estuary/public/wikipedia/flow.yaml#/collections/yourprefix~1wikipedia~1user-fact-table/derivation.
export class Derivation implements IDerivation {
newFactTablePublish(
source: NewFactTableSource,
_register: Register,
_previous: Register,
): Document[] {
let user_id = 0;
if (typeof source.log_params == "object" && !Array.isArray(source.log_params) && source.log_params.userid != undefined) {
user_id = source.log_params.userid;
}
const [yyyy, mm, dd] = source.meta.dt.split('-');
const dd2 = dd.substring(0, 2);
let date = yyyy + '-' + mm + '-' + dd2;
return [
{
userid: user_id,
count: 1,
last_updated: date,
},
]
}
}Publish the derivation:
flowctl catalog publish --source path/to/your/fact-table.flow.yaml
Your transformation will continue in real time based on the raw dataset, which is also updating in real time.
Create the continuous materialized view
Now, you'll materialize your new fact table to Postgres. You'll also materialize the source dataset to compare performance.
Go to the Materializations page in the Flow web app.
Click New Materialization.
For Connector, choose PostgreSQL. Add a unique name for the materialization, for example,
yourprefix/yourname-materialized-views-demo
.Fill out the Basic Config with:
A username and password for the Postgres instance.
Your database host and port.
The database name (if in doubt, use the default,
postgres
).
See the connector documentation if you need help finding these properties.
In the Collection Selector, search for and add the collection
estuary/public/wikipedia/recentchange
and name the corresponding Postgres Tablewikipedia_raw
.Also search for and add the collection you just derived, (for example,
yourprefix/wikipedia/user-fact-table
). If you skipped the derivation, use the provided version,estuary/public/wikipedia/user-fact-table
. Name the corresponding Postgres tablewikipedia_data_by_user
.Click Next to test the connection.
Click Save and Publish.
Explore the results
In your Postgres client of choice, note the size of each table and how they quickly change. Try running some basic queries against both and compare performance. See the blog post for ideas.
Once you're satisfied, and to prevent continual resource use, disable or delete your materialization from the Materializations page.