Who should use Flow?
Common pain points you might have, and how Flow addresses them.
Flow is designed to give backend engineers data integration superpowers. It seeks to be approachable for data analysts and allow other user cohorts to meaningfully contribute and participate.
If you answer "yes" to any of the following questions, Flow can help:
    Do you work with multiple databases and struggle to keep them in sync with one another?
    Do you issue repeated OLAP queries to your warehouse that are expensive to execute?
      Or do you need instant metrics for specific events like Black Friday?
    Do you operate separate batch and streaming systems, and grapple with reconciling them?
    Do you manage continuous processing workflows with tools like Spark, Flink, or Google Cloud Dataflow, and want a faster, easier-to-evolve alternative?
    Do you have a distributed data mesh, and are seeking a tool to help with orchestration?


These characteristics set Flow apart from other data integration workflows and address the pain points listed above.

Fully integrated pipelines

With Flow, you can build, test, and evolve pipelines that continuously capture, transform, and materialize data across all of your systems. With one tool, you can power workflows that have historically required you to first piece together services, then integrate and operate them in-house to meet your needs.
To achieve comparable capabilities to Flow you would need:
    A low-latency streaming system, such as AWS Kenesis
    Data lake build-out, such as Kenesis Firehose to S3
    Custom ETL application development, such as Spark, Flink, or AWS λ
    Supplemental data stores for intermediate transformation states
    ETL job management and execution, such as a self-hosting or Google Cloud Dataflow
    Custom reconciliation of historical vs streaming datasets, including onerous "backfills" of new streaming applications from historical data
Flow's declarative GitOps workflow is a dramatic simplification from this inherent complexity. It saves you time and costs, catches mistakes before they hit production, and keeps your data fresh across all the places you use it.

Efficient architecture

Flow mixes a variety of architectural techniques to deliver great throughput, avoid latency, and minimize operating costs. These include:‌
    Leveraging reductions to reduce the amount of data that must be ingested, stored, and processed, often dramatically
    Executing transformations predominantly in-memory
    Optimistic pipelining and vectorization of internal remote procedure calls (RPCs) and operations
    A cloud-native design that optimizes for public cloud pricing models
Flow also makes it easy to materialize small fact tables and roll-ups which you may today be repeatedly querying from much larger source datasets in a warehouse like Snowflake. This can dramatically lower warehouse costs.

Powerful transformations

With Flow, you can build pipelines that join a current event with an event that happened days, weeks, even years in the past. Flow can model arbitrary stream-to-stream joins without the windowing constraints imposed by other systems, which limit how far back in time you can join.
Flow transforms data in durable micro-transactions, meaning that an outcome, once committed, won't be silently re-ordered or changed due to a crash or machine failure. This makes Flow uniquely suited for operational workflows, like assigning a dynamic amount of available inventory to a stream of requests – decisions that, once made, should not be forgotten. Engineers can also evolve transformations as business requirements change, enriching them with new datasets or behaviors without needing to re-compute from scratch.

Data integrity

Flow supports strong schematization, durable transactions with exactly-once semantics, and easy end-to-end testing to ensure that your data is accurate and that changes don't break pipelines.
    JSON schemas are verified with every document read or written. If a document violates its schema, Flow pauses the pipeline, giving you a chance to fix the error.
    Schemas can encode constraints, like that a latitude value is between +90 and -90 degrees, or that a field is a valid email address.
    Flow projects JSON schema into other flavors, like TypeScript types or SQL tables. Strong type checking catches bugs before they're applied to production.
    Flow's declarative tests verify the integrated, end-to-end behavior of processing pipelines.

Dynamic scaling

The Flow runtime scales from a single process for local development up to a large Kubernetes cluster for high-volume production deployments. Processing tasks are quickly reassigned upon any machine failure for high availability.
Each process can also be scaled independently, at any time and without downtime. This is unique to Flow. Comparable systems require that an arbitrary data partitioning be decided upfront, a crucial performance knob that's awkward and expensive to change. Instead, Flow can repeatedly split a running task into two new tasks, each half the size, without stopping it or impacting its downstream uses.
Last modified 1mo ago
Copy link