Skip to main content

Introduction to MetricFlow Concepts

This is a short guide to MetricFlow concepts. This guide is for new users and assumes that you have MetricFlow set up and deployed. Check out our deployment docs if you have not completed this step.

The objective of using MetricFlow is to define and manage the logic used to build a company's metrics. MetricFlow is an opinionated set of abstractions that enables the data consumer to retrieve metric datasets efficiently from a data warehouse. There are a few key principles:

  • Flexible, but complete - Define logic in abstractions that are flexible enough to construct any metric on any data model
  • Don't Repeat Yourself (DRY)- Whenever possible enable the definition of metrics
  • Simple with progressive complexity - Rely on known concepts and structures in data modeling to ensure that MetricFlow is approachable. Create escape hatches that enable folks to pursue more advanced and unsupported features.
  • Performant and Efficient- Enable performance optimizations of centralized data engineering while enabling distributed definition and ownership of logic.

Overview of MetricFlow Framework Concepts

MetricFlow is Transform’s semantic metric layer that aims to capture metric logic by constructing appropriate queries to the many granularities and dimensions that may be useful for various business applications.

MetricFlow is also available to the public today as open source, and we happily encourage data practitioners and enthusiasts to contribute. The objective of using MetricFlow is to define and manage the logic used to build a company's metrics via abstractions defined in YAML, as summarized in the following sections.

The Transform Model

A “model” in the Transform product refers to the data landscape made available as a corpus of YAML configurations to build metrics. You can think of the model as a graph, or a set of graphs–with tables being the nodes and joins as the edges.

When asked to generate a metric, Transform’s SQL engine will infer the appropriate traversal paths between tables via the MetricFlow framework, which is defined in YAML files as data sources and metrics. Furthermore, once the appropriate data sources and metrics are properly defined, metrics can be consumed in downstream integrations as well as in the Transform App.

For further details, please see What is a Transform Model?.

Data Sources

Data sources are the starting points, or nodes, of data, and typically reflect the tables within your data warehouse. You may define as many data sources out of each data table as you like.

Similar to a data table, a data source has specific attributes, or metadata, associated with it in order to properly define relevant information (such as the table name and primary keys) for the graph to be traversed properly.

Further comprehensive reading is available (see the References section in the left menu), but the basic pieces of metadata involved in a data source are:

  • Identifiers: The join keys of your data source (think of these as the traversal paths, or edges between data sources)
  • Dimensions: The slices and dices you’d like your metric to be aggregated by
  • Measures: The aggregation functions resulting in a numerical value as desired, which can also become the building blocks of your metrics.

Metrics

Once data sources are defined, specified aggregations or functions can be defined as measures (examples include but are not limited to average, sum, and count distinct). Think of measures as the building blocks of your metrics -- a measure itself can be a metric, but a metric can consist of 1 or more measures (i.e. a ratio).

A key concept in MetricFlow includes metrics, which are functions that can take in various parameters (such as measures, constraints, or further mathematical functions) to define new quantiative indicators.

Furthermore, a metric is enriched in MetricFlow using dimensions. Without dimensions, a metric is simply a number for all time and lacks the necessary context to be useful.

Lastly, in order to be a single source of truth for business teams and organizations, MetricFlow must be able to take in any type of modeled data, construct a wide variety of metric types, and serve the metrics to the appropriate place. If MetricFlow cannot achieve these goals, you, as our user, would have to define your metrics elsewhere and we at Transform would not be accomplishing our product philosophy. Feature requests for new metric types, data modeling approaches, and APIs are always welcome!

Example Data Model

The example data schema image below shows a number of different types of data tables:

transactions is a production DB export that has been cleaned up and organized for analytical consumption

visits is a raw event log

stores is a cleaned up and fully normalized dimensional table from a daily production database export

products is a dimensional table that came from an external source such as a wholesale vendor of the good this store sells.

customers is a partially denormalized table in this case with a column derived from the transactions table through some upstream process

MetricFlow-SchemaExample

In order to make this more concrete, let's consider the metric revenue defined as the SQL expression select sum(price * quantity) as revenue from transactions

Metric Queries

In most business contexts, the metric revenue is often requested to be split amongst several slices, such as:

  • time, i.e. date_trunc(created_at, 'day')
  • product, i.e. product_category from the product table.

Data analysts typically would calculate the revenue metric aggregated as such:

select
date_trunc(transactions.created_at, 'day') as day
, products.category as product_category
, sum(transactions.price * transactions.quantity) as revenue
from
transactions
left join
products
on
transactions.product_id = products.product_id
group by 1, 2

It is also very likely that an analyst may be requested for further breakdowns of the metric, such as identifying how much of the revenue was driven by bulk purchases. Extra metrics can also be thrown in as part of the analysis, such as the count of active customers. This process often results in multiple queries written by different analysts, leading to a nightmare of data maintenance, governance, and integrity.

MetricFlow simplifies and streamlines metrics for you. Given the sample schema above and revenue as our metric of interest, we may define a data source in MetricFlow as such:

data_source:
name: transactions
description: A record for every transaction that takes place. Carts are considered multiple transactions for each SKU.
owners: support@transform.co
sql_table: schema.transactions

# --- IDENTIFIERS ---
identifiers:
- name: transaction_id
type: primary
- name: customer_id
type: foreign
- name: store_id
type: foreign
- name: product_id
type: foreign

# --- MEASURES ---
measures:
- name: revenue
description:
expr: price * quantity
agg: sum
- name: quantity
description: Quantity of products sold
expr: quantity
agg: sum
- name: active_customers
description: A count of distinct customers completing transactions
expr: customer_id
agg: count_distinct

# --- DIMENSIONS ---
dimensions:
- name: ds
type: time
expr: date_trunc('day', ts)
type_params:
is_primary: true
time_granularity: day
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end

Notice how the data source above easily allows you to define a measure named revenue as a transparent function of two underlying columns in the data table schema.transactions. There is also a time dimension ds, which allows for daily granularity and can be further aggregated to weekly, monthly, etc chunks. Furthermore, notice how a categorical dimension is_bulk_transaction is specified via a case statement to capture transactions of bulk purchases.

Similarly we could then add a products data source on top of the schema.products data table to incorporate even more dimensions to slice and dice our revenue metric:

data_source:
name: products
description: A record for every product available through our retail stores.
owners: support@transform.co
sql_table: schema.products

# --- IDENTIFIERS ---
identifiers:
- name: product_id
type: primary

# --- DIMENSIONS ---
dimensions:
- name: category
type: categorical
- name: brand
type: categorical
- name: is_perishable
type: categorical
expr: |
category in ("vegetables", "fruits", "dairy", "deli")

Notice the identifiers present in our data sources products and transactions: MetricFlow does the heavy-lifting for you by traversing the appropriate join keys to identify the available dimensions to slice and dice your revenue metric.

Suppose an even more complicated metric is requested, such as the daily revenue of perishable goods per active customer. Without MetricFlow, the original SQL by a data analyst would perhaps look something like this:

select
date_trunc(transactions.created_at, 'day') as day
, products.category as product_category
, sum(transactions.price * transactions.quantity) as revenue
, count(distinct customer_id) as active_customers
, sum(transactions.price * transactions.quantity)/count(distinct customer_id) as perishable_revenues_per_active_customer
from
transactions
left join
products
on
transactions.product_id = products.product_id
where
products.category in ("vegetables", "fruits", "dairy", "deli")
group by 1, 2

MetricFlow simplifies the SQL process via metric YAML configurations as seen below. Furthermore, these YAML configurations can be committed to Github, ensuring transparency and approval amongst data and business teams as the single source of truth.

metric:
name: perishables_revenue_per_active_customer
description: Revenue from perishable goods (vegetables, fruits, dairy, deli) for each active store.
type: ratio
type_params:
numerator: revenue
denominator: active_customers
constraints: |
product__category in ("vegetables", "fruits", "dairy", "deli")

Accessing Metrics

MetricFlow exposes a GraphQL API with a number of clients built on top of that including:

  • Command Line Interface (CLI) to pull data locally and improve the dev workflow
  • SQL over JDBC to integrate with BI Tools and other SQL Clients
  • Python Library to pull metrics into Jupyter or other analytical interfaces
  • React Components to build embedded analytics
  • Airflow Operators to schedule API requests and pre-construction of metrics
  • GraphQL interface underlies all of these is also exposed for the end-user to build their own interfaces

Each of the above APIs follows a common format of pulling metrics in the format of "metrics by dimensions." As a simple example to pull revenue by day in the CLI, the user would write the following request

With MetricFlow:

mf query --metrics revenue --dimensions ds

With Transform:

mql query --metrics revenue --dimensions ds

More can be found in the various APIs docs on more complicated expressions.

Further Reading

For further reading on the concepts of MetricFlow please see documentation on:

  • All details regarding configuration files for data sources and metrics can be found in the section Reference
  • Priming: A guide to how MetricFlow takes a SQL Query data source and builds a candidate data source similar to a SQL Table data source.
  • Query Construction: A guide to how MetricFlow takes various candidate Data Sources and resolves an API request into a SQL Query.
  • Caching: A guide into the various caching mechanisms that allow MetricFlow to resolve queries more efficiently.

FAQ

  • Do my data sets need to be normalized?
    • Not at all! While a cleaned and well-modeled data set can be extraordinarily powerful and is the ideal input, you can use any dataset from raw to fully denormalized datasets.
    • It's recommended that you apply data consistency and quality transformations such as filtering bad data, normalizing common objects, and data modeling of keys and tables in upstream applications. Transform is most efficient at doing data denormalization, rather than normalization
    • If you have not invested in data consistency, that is okay. MetricFlow can take SQL queries or expressions to define consistent datasets.
  • Why is normalized data the ideal input?
    • MetricFlow is built to do denormalization efficiently. There are better tools to take raw datasets and accomplish the various tasks required to build data consistency and organized data models. On the other end, by putting in denormalized data you are potentially creating redundancy which is technically challenging to manage and you are reducing the potential granularity that Transform can use to aggregate metrics.
  • Why not just make metrics the same as measures?
    • One principle of our MetricFlow is to reduce the duplication of logic sometimes referred to as Don't Repeat Yourself(DRY).
    • Many metrics are constructed from reused measures and in some cases constructed from measures from different data sources. This allows for metrics to be built breadth-first (metrics that can stand alone) instead of depth-first (where you have multiple metrics acting as functions of each other).
    • Additionally, not all metrics are constructed off of measures. As an example, a conversion metric is likely defined as the presence or absence of an event record after some other event record.
  • How does MetricFlow handle joins?
    • MetricFlow builds joins based on the types of keys and parameters that are passed to identifiers. To better understand how joins are constructed see our documentations on join types.
    • Rather than capturing arbitrary join logic, Transform captures the types of each identifier and then helps the user to navigate to appropriate joins. This allows us to avoid the construction of fan out and chasm joins as well as generate legible SQL.
  • Are identifiers and join keys the same thing?
    • If it helps you to think of identifiers as join keys, that is very reasonable. Identifiers in MetricFlow have applications beyond joining two tables, such as acting as a dimension.
  • Can a table without a primary or unique identifier have dimensions?
    • Yes, but because a dimension is considered an attribute of the primary or unique identifier of the table, they are only useable by the metrics that are defined in that table. They cannot be joined to metrics from other tables. This is common in event logs.