Create a real-time materialized view in PostgreSQL
PostgreSQL supports materialized views: database objects that contain the result of a query, usually a focused subset of a large dataset.
In this tutorial, you'll use Flow and your Postgres instance to create something that's not possible in Postgres alone: a materialized view that updates continuously based on a real-time data feed.
Prerequisites
-
An Estuary Flow account. If you don't have one, visit the Flow web app to register for free.
-
A GitLab, GitHub, or BitBucket account. You'll use this to log into GitPod, the cloud development environment integrated with Flow.
- Alternatively, you can complete this tutorial using a local development environment. In that case, you'll need to install flowctl locally. Note that the steps you'll need to take will be different. Refer to this guide for help.
-
A Postgres database set up to allow connections from Flow. Amazon RDS, Amazon Aurora, Google Cloud SQL, Azure Database for PostgreSQL, and self-hosted databases are supported.
Introduction
Materialized views in Postgres give you a powerful way narrow down a huge dataset into a compact one that you can easily monitor. But if your data is updating in real-time, traditional materialized views introduce latency. They're batch workflows — the query is run at a set interval.
To get around this, you'll need to perform a real-time transformation elsewhere. Flow derivations are a great way to do this.
For this example, you'll use Estuary's public data collection of recent changes to Wikipedia, captured from the Wikimedia Foundation's event stream.
The raw dataset is quite large. It captures every change to the platform — about 30 per second — and includes various properties. Written to a Postgres table, it quickly grows to an size that's very expensive to query.
First, you'll scope the raw data down to a small fact table with a derivation.
You'll then materialize both the raw and transformed datasets to your Postgres instance and compare performance.
Loading the Wikipedia Demo
-
Navigate to the Live Demo page and click on
See the capture
. -
After accepting the pop up, Estuary will populate your Sources, Collections and Destinations with the Wikipedia Demo tasks.
Check out the source data
-
Got the the collections page of the Flow web app.
-
Search for
demo/wikipedia/recentchange
and click on its name. -
On the Collection Details page, click the Spec tab.
The collection schema has many fields. Because Wikipedia sees a lot of edits, this would yield a large, unwieldy table in Postgres.
TipTo save on performance, you can also perform this tutorial using the smaller
demo/wikipedia/recentchange-sampled
collection. Apart from the collection name, all other steps are the same.Learn more about Flow collections and schemas.
Now you'll create the derivation. A derivation is a new collection that's defined by a transformation. First, you'll define the collection's schema. Then, you'll write the transformation to shape the data to that schema.
Add a derivation to transform data
You'll write your derivation using a cloud development environment integrated in the Flow web app.
-
Go back to the collections page and click the New Transformation button.
-
Set the source collection to the raw Wikipedia data. Search for and select
demo/wikipedia/recentchange
. -
Set the transformation language to SQL.
-
Give the derivation a name. From the dropdown, choose the name of your catalog prefix and append a unique name, for example
yourprefix/wikipedia/user-fact-table
. -
Click Proceed to GitPod to create your development environment. Sign in with one of the available account types.
-
On the New Workspace screen, keep the Context URL option selected and click Continue.
A GitPod development environment opens. A stubbed-out derivation with a SQL transformation has already been created for you. Next, you'll locate and open the source files.
-
Each slash-delimited prefix of your derivation name has become a folder. Open the nested folders to locate a
flow.yaml
file.Following the example above, you'd open the folders called
yourprefix
, thenwikipedia
, to find the correctflow.yaml
file. Its contents look like this:collections:
yourprefix/wikipedia/user-fact-table:
schema:
properties:
your_key:
type: string
required:
- your_key
type: object
key:
- /your_key
derive:
using:
sqlite:
migrations:
- user-fact-table.migration.0.sql
transforms:
- name: recentchange
source: demo/wikipedia/recentchange
shuffle: any
lambda: user-fact-table.lambda.recentchange.sqlYour first order of business is to replace the placeholder schema and collection key. As we saw earlier, the source collection's schema and key caused every Wikipedia event to generate a new document.
You'll fix that here.
-
Replace the existing
schema
andkey
stanzas with the following:schema:
properties:
edits_this_day:
reduce:
strategy: sum
type: integer
date:
format: date
type: string
user:
type: string
reduce:
strategy: merge
required:
- user
- date
- edits_this_day
type: object
key:
- /user
- /date
The new schema contains reduction annotations. These sum the changes made by a given user on a given date. The collection is now keyed on each unique combination of user ID and date. It has just three fields: the user, date, and the number of changes made by that user on that date.
Next, you'll add the transformation.
-
In the
transforms
stanza, give the transformation a new name to differentiate it from the name of the source collection. For example:transforms:
- name: dailychangesbyuser -
Update the shuffle key. Since we're working with a large dataset, this ensures that each user is processed by the same task shard. This way, you'll prevent Flow from creating multiple counts for a given user and date combination.
Learn more about shuffles.
shuffle: { key: [ /user ] }
Now, the transform needs is the SQL lambda function — the function that will shape the source data to fit the new schema. Flow has created another file to contain it.
-
Open the file called
user-fact-table.lambda.recentchange.sql
. -
Replace its contents with
select $user, 1 as edits_this_day, date($meta$dt) as date where $user is not null;
This creates the
edits_this_day
field we referenced earlier, and starts the counter at 1. It converts the timestamp into a simplified date format. Finally, it filters outnull
users (which occasionally occur in the Wikipedia data stream and would violate your schema). -
All pieces of the derivation are in place. Double check your files against these samples:
- flow.yaml
- user-fact-table.lambda.recentchange.sql
---
collections:
yourprefix/wikipedia/user-fact-table:
schema:
properties:
edits_this_day:
reduce:
strategy: sum
type: integer
date:
format: date
type: string
user:
type: string
reduce:
strategy: merge
required:
- user
- date
- edits_this_day
type: object
key:
- /user
- /date
derive:
using:
sqlite:
migrations:
- user-fact-table.migration.0.sql
transforms:
- name: dailychangesbyuser
source: demo/wikipedia/recentchange
shuffle: { key: [ /user ] }
lambda: user-fact-table.lambda.recentchange.sql
select $user, 1 as edits_this_day, date($meta$dt) as date where $user is not null;
- Run the derivation locally and preview its output:
flowctl preview --source flow.yaml
In your terminal, you'll see JSON documents that look like:
{"date":"2023-07-18","edits_this_day":3,"user":"WMrapids"}
This looks right: it includes the correctly formatted date, the number of edits, and the username. You're ready to publish.
-
Stop the local derivation with Ctrl-C.
-
Publish the derivation:
flowctl catalog publish --source flow.yaml
The message Publish successful
means you're all set.
Your transformation will continue in real time based on the raw dataset, which is also updating in real time.
You're free to close your GitPod.
Create the continuous materialized view
Now, you'll materialize your new fact table to Postgres. You'll also materialize the source dataset to compare performance.
-
Go to the Destinations page in the Flow web app.
-
Click New Materialization.
-
Find the PostgreSQL and click Materialization.
-
Add a unique name for the materialization, for example,
yourprefix/yourname-materialized-views-demo
. -
Fill out the Basic Config with:
-
A username and password for the Postgres instance.
-
Your database host and port.
-
The database name (if in doubt, use the default,
postgres
).
See the connector documentation if you need help finding these properties.
-
-
In the Source Collections browser, search for and add the collection
demo/wikipedia/recentchange
and name the corresponding Postgres Tablewikipedia_raw
. -
Also search for and add the collection you just derived, (for example,
yourprefix/wikipedia/user-fact-table
). Name the corresponding Postgres tablewikipedia_data_by_user
. -
Click Next to test the connection.
-
Click Save and Publish.
Explore the results
In your Postgres client of choice, note the size of each table and how they quickly change. Try running some basic queries against both and compare performance. See the blog post for ideas.
Once you're satisfied, and to prevent continual resource use, disable or delete your materialization from the Destinations page.