Create a Real-Time Materialized View in PostgreSQL

PostgreSQL supports materialized views: database objects that contain the result of a query, usually a focused subset of a large dataset.

In this tutorial, you'll use Flow and your Postgres instance to create something that's not possible in Postgres alone: a materialized view that updates continuously based on a real-time data feed.

Prerequisites

An Estuary Flow account. If you don't have one, visit the Flow web app to register for free.
flowctl installed and authenticated
A Postgres database set up to allow connections from Flow. Amazon RDS, Amazon Aurora, Google Cloud SQL, Azure Database for PostgreSQL, and self-hosted databases are supported.

Introduction

Materialized views in Postgres give you a powerful way to narrow down a huge dataset into a compact one that you can easily monitor. But if your data is updating in real-time, traditional materialized views introduce latency. They're batch workflows — the query is run at a set interval.

To get around this, you'll need to perform a real-time transformation elsewhere. Flow derivations are a great way to do this.

For this example, you'll use Estuary's public data collection of recent changes to Wikipedia, captured from the Wikimedia Foundation's event stream.

The raw dataset is quite large. It captures every change to the platform — about 30 per second — and includes various properties. Written to a Postgres table, it quickly grows to an size that's very expensive to query.

First, you'll scope the raw data down to a small fact table with a derivation.

You'll then materialize both the raw and transformed datasets to your Postgres instance and compare performance.

Loading the Wikipedia Demo

Navigate to the Live Demo page and click on See the capture.
After accepting the pop up, Estuary will populate your Sources, Collections and Destinations with the Wikipedia Demo tasks.

Check out the source data

Got the the collections page of the Flow web app.
Search for demo/wikipedia/recentchange and click on its name.
On the Collection Details page, click the Spec tab.

The collection schema has many fields. Because Wikipedia sees a lot of edits, this would yield a large, unwieldy table in Postgres.

Tip
To save on performance, you can also perform this tutorial using the smaller demo/wikipedia/recentchange-sampled collection. Apart from the collection name, all other steps are the same.

Learn more about Flow collections and schemas.

Now you'll create the derivation. A derivation is a new collection that's defined by a transformation. First, you'll define the collection's schema. Then, you'll write the transformation to shape the data to that schema.

Add a derivation to transform data

You'll write your derivation with the help of flowctl, Estuary's CLI.

Open a terminal and pull the demo collection spec locally.

flowctl catalog pull-specs --name demo/wikipedia/recentchange

This command will write a new file structure in your current directory. Each slash-delimited prefix of your derivation name has become a folder. Open the nested folders to find the deepest-nested flow.yaml file, in the demo/wikipedia/ directory.

Open this flow.yaml file in a code editor. It should contain a collection spec similar to:

---
collections:
  demo/wikipedia/recentchange:
    writeSchema: recentchange.write.schema.yaml
    readSchema: recentchange.read.schema.yaml
    key:
      - /_meta/file
      - /_meta/offset
    projections:
      flow_document: ""
    expectPubId: 1234ff4b568003a5

The derivation will be a new collection in this file. Add the derivation specification, using your own prefix for the collection name:
```
yourprefix/wikipedia/user-fact-table:
  schema:
    properties:
      edits_this_day:
        reduce:
          strategy: sum
        type: integer
      date:
        format: date
        type: string
      user:
        type: string
    reduce:
      strategy: merge
    required:
      - user
      - date
      - edits_this_day
    type: object
  key:
    - /user
    - /date
  derive:
    using:
      sqlite: {}
    transforms:
      - name: dailychangesbyuser
        source: demo/wikipedia/recentchange
        shuffle: { key: [ /user ] }
        lambda: user-fact-table.lambda.recentchange.sql
```
Let's walk through some key aspects of this specification.

The new schema contains reduction annotations. These sum the changes made by a given user on a given date. The collection is now keyed on each unique combination of user ID and date. It has just three fields: the user, date, and the number of changes made by that user on that date.

We also specify that we're deriving the new collection using a SQL transformation. You may notice that we reference files that don't exist yet. We'll fix this next.

In the transforms stanza, we specify that we want to shuffle on the user key. Since we're working with a large dataset, this ensures that each user is processed by the same task shard. This way, you'll prevent Flow from creating multiple counts for a given user and date combination.
With the derivation specification in place, we need to generate the associated SQL transformation files. This will hold the function that will shape the source data to fit the new schema.

You can create a stub file to get started with flowctl.
```
flowctl generate --source flow.yaml
```
Open the new file called user-fact-table.lambda.recentchange.sql.
Replace its contents with
```
select $user, 1 as edits_this_day, date($meta$dt) as date where $user is not null;
```
This creates the edits_this_day field we referenced earlier, and starts the counter at 1. It converts the timestamp into a simplified date format. Finally, it filters out null users (which occasionally occur in the Wikipedia data stream and would violate your schema).
All pieces of the derivation are in place. Double check your files against these samples:

flow.yaml
user-fact-table.lambda.recentchange.sql

---
collections:
  yourprefix/wikipedia/user-fact-table:
    schema:
      properties:
        edits_this_day:
          reduce:
            strategy: sum
          type: integer
        date:
          format: date
          type: string
        user:
          type: string
      reduce:
        strategy: merge
      required:
        - user
        - date
        - edits_this_day
      type: object
    key:
      - /user
      - /date
    derive:
      using:
        sqlite: {}
      transforms:
        - name: dailychangesbyuser
          source: demo/wikipedia/recentchange
          shuffle: { key: [ /user ] }
          lambda: user-fact-table.lambda.recentchange.sql

 select $user, 1 as edits_this_day, date($meta$dt) as date where $user is not null;

Run the derivation locally and preview its output:
```
flowctl preview --source flow.yaml
```
In your terminal, you'll see JSON documents that look like:
```
{"date":"2023-07-18","edits_this_day":3,"user":"WMrapids"}
```
This looks right: it includes the correctly formatted date, the number of edits, and the username. You're ready to publish.
Stop the local derivation with Ctrl-C.
Publish the derivation:

flowctl catalog publish --source flow.yaml

The message Publish successful means you're all set. Your transformation will continue in real time based on the raw dataset, which is also updating in real time.

Create the continuous materialized view

Now, you'll materialize your new fact table to Postgres. You'll also materialize the source dataset to compare performance.

Go to the Destinations page in the Flow web app.
Click New Materialization.
Find the PostgreSQL and click Materialization.
Add a unique name for the materialization, for example, yourprefix/yourname-materialized-views-demo.
Fill out the Basic Config with:
1. A username and password for the Postgres instance.
2. Your database host and port.
3. The database name (if in doubt, use the default, postgres).
See the connector documentation if you need help finding these properties.
In the Source Collections browser, search for and add the collection demo/wikipedia/recentchange and name the corresponding Postgres Table wikipedia_raw.
Also search for and add the collection you just derived, (for example, yourprefix/wikipedia/user-fact-table). Name the corresponding Postgres table wikipedia_data_by_user.
Click Next to test the connection.
Click Save and Publish.

Explore the results

In your Postgres client of choice, note the size of each table and how they quickly change. Try running some basic queries against both and compare performance. See the blog post for ideas.

Once you're satisfied, and to prevent continual resource use, disable or delete your materialization from the Destinations page.

Resources

About derivations

Prerequisites​

Introduction​

Loading the Wikipedia Demo​

Check out the source data​

Add a derivation to transform data​

Create the continuous materialized view​

Explore the results​

Resources​