How Flow compares to other storage and data processing technologies
Flow offers integrated ingestion, storage, transformation, and materialization layers. Therefore, it is comparable to multiple different types of technologies depending on the use case.
For simplicity, we’ve broken these down into storage technologies and data processing technologies. The below charts summarize similarities and differences in each group and are followed by detailed comparisons to the most popular systems.
Because it stores data, Flow can be compared to data storage layers:
(1) Kafka and Pulsar offer cloud storage offload. This improves storage costs, but files that use custom formats and reads must still proxy through brokers, incurring extra transfer costs and limiting scale. Effectively data still lives in a "walled garden."
(2) Confluent offers automatic scaling for Kafka brokers but it is not available natively. Management and scaling of streaming applications are not available.
(3) While technically "unlimited," these systems incur storage and access costs with every query (as compared to cloud storage, which is much cheaper to store and read).
(4) These systems support materializations from other tables, meaning that source data must be stored and processed with each refresh of the view.
Since Flow offers transformations, it can also be compared to data processing tools:
(1) These systems offer separate stream and batch modes but must be manually reconciled.
(2) Materialize offers the ability to pull historical data from S3, but doesn't support general-purpose batch data.
(3) OLAP technologies are routinely used to create and persist materialized views in other databases, but doing so is clunky and requires coordination using other tools such as Airflow, DBT, or scripting.
(4) Queries are constrained by their partitioned internal states and can't be dynamically scaled. ksqlDB imposes windowing constraints on queries.
(5) Google Cloud Dataflow requires that you write reducers yourself.
Flow is unique in the continuous processing space. It is similar to and at the same time wholly unlike a number of other systems and tools.
Google Cloud Dataflow
Flow’s most apt comparison is to Google Cloud Dataflow (built on Apache Beam), with which Flow has the most conceptual overlap.
Like Beam, Flow’s primary primitive is a collection. You build a processing graph by relating multiple collections together through procedural transformations, or lambdas. As with Beam, Flow’s runtime performs automatic data shuffles and is designed to allow fully automatic scaling. Also like Beam, collections have associated schemas.
Unlike Beam, Flow doesn’t distinguish between batch and streaming contexts. Flow unifies these paradigms under a single collection concept, allowing users to seamlessly work with both data types.
Also, while Beam allows for optional user-defined combine operators, Flow’s runtime always applies combine operators. These are built using the declared semantics of the document’s schema, which makes it much more efficient and cost-effective to work with streaming data.
Finally, Flow allows stateful stream-to-stream joins without the “window-ization” semantics imposed by Beam. Notably, Flow’s modeling of state – via its per-key register concept – is substantially more powerful than Beam's “per-key-and-window” model. For example, registers can trivially model the cumulative lifetime value of a customer.
PipelineDB / ksqlDB / Materialize
ksqlDB and Materialize are new SQL databases that focus on streaming updates of materialized views. PipelineDB was a PostgreSQL extension that pioneered in this space.
Flow is not – nor does it want to be – a database. It aims to enable all of your other databases to serve continuously materialized views. Flow materializations use the storage provided by the target database to persist the view’s aggregate output, and Flow focuses on mapping, combining, and reducing updates of that aggregate as source collections change.
While Flow tightly integrates with the SQL table model (via projections), Flow can also power document stores like Elastic and CosmosDB, which deal in Flow’s native JSON object model.
BigQuery / Snowflake / Presto
Flow is designed to integrate with Snowflake and BigQuery both by adding Flow collections as external, Hive-partitioned tables within these systems, as well as direct materializations into them.
First, Flow is used to capture and “lake” data drawn from a pub/sub topic or file system, for which Flow produces an organized file layout of compressed JSON in cloud storage. Files are named to allow Hive predicate push-down (for example, “SELECT count(*) where utc_date = ‘2020-11-12’ and region = ‘US’), enabling substantially faster queries.
For data that is read infrequently, this can be cheaper than directly ingesting data into Snowflake or BigQuery; you consume no storage or compute credits until you actually query data.
For frequently read data, Flow can directly materialize into files so that they are stored in their native, efficient-to-query format.
Finally, Flow can help you avoid expenses associated with queries you frequently pull from a data warehouse by keeping an up-to-date view of them where you want it.
dbt is a tool that enables data analysts and engineers to transform data in their warehouses more effectively.
In addition to – and perhaps more important than – its transform capability, dbt brought an entirely new workflow for working with data: one that prioritizes version control, testing, local development, documentation, composition, and re-use.
Like dbt, Flow uses a declarative model and tooling, but the similarities end there. dbt is a tool for defining transformations, which are executed within your analytics warehouse. Flow is a tool for delivering data to that warehouse, as well as continuous operational transforms that are applied everywhere else.
They can also make lots of sense to use together: Flow is ideally suited for “productionizing” insights first learned in the analytics warehouse.
Spark / Flink
Spark and Flink are generalized execution engines for batch and stream data processing. They’re well known – particularly Spark – and both are actually available “runners” within Apache Beam. Spark can be described as a batch engine with stream processing add-ons, where Flink could be described as a stream processing engine with batch add-ons.
Flow best inter-operates with Spark through its “lake” capability. Spark can view Flow data collections as Hive-partitioned tables, or just directly process bucket files.
For stream-to-stream joins, both Spark and Flink roughly share the execution model and constraints of Apache Beam. In particular, they impose complex “window-ization” requirements that also preclude their ability to offer continuous materializations of generalized stream-to-stream joins (for example, current lifetime value of a customer).
Kafka / Pulsar
Flow is built on Gazette, which is most similar to log-oriented pub/sub systems like Kafka or Pulsar. Flow also uses Gazette’s consumer framework, which has similarities to Kafka Streams. Both manage scale-out execution contexts for consumer tasks, offer durable local task stores, and provide exactly-once semantics, though there are key differences.
Unlike those systems, Flow and Gazette use regular files with no special formatting, such as compressed JSON, as the primary data representation. This powers Flow's ability to integrate with other analytic tools. During replays, Flow reads historical data directly from cloud storage, which is strictly faster and more scalable and reduces the load on brokers.
Gazette’s implementation of durable task stores also enables Flow’s novel, zero-downtime task-splitting technique for turnkey scaling.