What are materializations in MetricFlow?
Materializations enable you to build denormalized tables in a data warehouse. You can define a set of metrics and dimensions that are commonly used together, which could then be scheduled and written to the data warehouse on a periodic basis.
What are materializations in Transform?
In Transform, Materializations are pre-computed and used to "warm-up" the cache for the most common metrics and dimensions. Materialization Tables are never evicted from the cache, and they can be queried directly as data marts in downstream tools from the data warehouse.
The primary benefit of materializations is that you don't need to manually build data marts and those tables are then used to make incremental aggregations more performant through the caching mechanism.
- Configuration & Definition: The Materialization definition occurs inside MetricFlow, which is similar to how you currently define metrics.
- Calling a Materialization: You call a Materialization via
mf materialize(MetricFlow) OR
- Writing the Materialized table: Materialize will write data to your data warehouse or in a specified location. You can optionally write it to a fast cache.
- Scheduling a Materialization: To get the most value out of materializations, we recommend you schedule the materialization on a given cadence to ensure your data is ready all the time.
- The metrics you define in the materialization configuration must already be defined in Transform.
- There must be a primary time dimension that is consistently named across your data sources. This time dimension will be used to set the date parameters of the materialization.
Defining a Materialization
Here is an example of a materialization definition that reflects data from a booking service company.
name: user_bookings_summary # name your materialization. this will be the table name written in your data warehouse.
description: # add an optional description.
metrics: # list all the metrics you want to materialize in your materialization. all metrics must be defined in your metrics.yaml.
dimensions: # list all corresponding dimensions you want to include for these metrics. these must all be defined in your data sources.
# Optionally, specify the name of the schema and the table for the materialization.
# MQL will try to create / drop this table as needed. It's recommended that the
# permissions of the user associated with the MQL server be configured to have
# limited permissions.
# Optional destinations:
# Materializations always write to the data warehouse.
# Optionally provide additional destinations for the materialization and rollup
- location: DW
- location: FAST_CACHE
# format: WIDE # WIDE is the only supported option currently
# dimension_sets: # rollups allow you to create separate tables that group by a subset of the dimensions in your materialization
# [metric_time] # add a dimension you want to a rollup by. this should be a subset of the dimensions you list in your materialization. These rollups will be stored in separate tables than the core materializations
Materializations have a few inputs
- Name: The name of the materialization
- Owners: The owners of the materialization
- Description: An optional description
- Metrics: The important metrics that you want to be pre-computed in your materialization. These must already be defined in Transform to be referenced by the materialization.
- Dimensions: The corresponding dimensions that are important to you and that associate with your metrics.
- Destinations: Data is constructed in the warehouse and written to a table as
CREATE TABLE <TFD_MQL_MATERIALIZATIONS_SCHEMA>.<NAME> AS .... You can optionally write to more destinations, which includes destinations of your rollup tables. Options for destinations are
- Rollups: Rollups are an additional argument to each destination which allow you to create separate tables for a subset of the dimensions for all the metrics listed in the materialization.
- Where: Optional constraint on the materialization.
Calling a Materialization
Once you define a materialization, you can invoke it by using MetricFlow. While not required, it is best practice to pass in
end-time arguments. If you do not, the entire table will be materialized. Depending on the size of your datasets, this materialization could be expensive to build. The materialization command is idempotent and will incrementally add any gaps in the time range specified.
mf materialize --materialization-name my_materialization --start-time <date> --end-time <date>
mql materialize --materialization-name my_materialization --start-time <date> --end-time <date>
We recommend you use an existing scheduling tool to schedule bash commands from the CLI or python interface. You can also use the
materialize command to run a manual backfill, or anytime you want to re-run a materialization, even if you already have the command scheduled.
--materialization-name TEXT Name of materialization to materialize
--start-time TEXT iso8601 timestamp to materialize from
--end-time TEXT iso8601 timestamp to materialize to
-t, --timeout INTEGER Sets the timeout to wait for query completion. Not applicable in --detach mode.
--detach BOOLEAN Returns the created query ID to allow for
--help Show this message and exit.
When materialize is run, a table is written to your data warehouse.
- The table name is the same name as the materialization name.
- Unless there is a warehouse destination specified, the materialization will be written to the schema specified in the project configs with the name of the materialization.
- To write to an arbitrary schema, pass in the schema and table name as a location.
- There is a write-lock when writing data. You can only materialize a single time range at a time for each materialization.
By default, materializations are stored in the cache schema, but you can also write them to an arbitrary schema.
We recommend you specify the environment variable which indicates which schema we should write the data in, and it should be a schema that Transform is aware of. You can use the
TFD_MQL_MATERIALIZATIONS_SCHEMA environment variable to set this. For example, if you're launching the MQL server container with the Docker run command and the
-env-file flag, the environment variable file should contain a line similar to:
TFD_MQL_MATERIALIZATIONS_SCHEMA = tfd_materializations;
You can write to a different schema, but if you don't write to a Transform-managed schema, Transform does not have a way to manage or update this table. It is preferred that you write to Transform managed schema. In order to materialize to a different schema, the schema must be provided as the
output_table parameter in the materialize call.
Scheduling a Materialization
If you are using Airflow, you can use Transform's Airflow operator to schedule materializations. If you do not use Airflow, we recommend you use your existing scheduling tool.
If your underlying metric configuration changes and you re-materialize the dataset, we will invalidate any previous information stored that associates with the old configuration.
When a configuration change occurs, all materialized data created from the old definition will be ignored, until you manually run
materialize or the scheduled jobs kick in. When the materialize command is run and detects out-of-date data in the table, it will drop those rows or the entire table as necessary.
If your underlying data changes (e.g., updates to the data warehouse), Transform will not be aware of these changes, so you should re-materialize your data for the affected time periods by running the
You can optionally create additional rollup tables with data in your materialization using the
The value of rollups is for creating and quickly accessing a ready dataset from the materialization that contains a subset of the dimensions defined in the materialization config. For example, if you have two dimensions in your materialization, country, and date, but you only want to group by date in a certain table, you can store that combination in a rollup table. Rollups stores data in tables that are prefixed with the same name as your materialization, with an additional hash or identifier to uniquely identify the table. The rollup tables will be used to quickly respond to queries that are an exact match of the dimensions contained in the roll-up.
Backfills should be done manually by running
materialize with the backfill dates.
Drop a Materialization
If you need to drop a materialization, you can use the
mf drop-materialization --materialization-name OR
mql drop-materialization --materialization-name command. It is best practice to drop materializations you are no longer using. Your metrics will try and pull from the materialization first, so if a materialization is outdated it could lead to incorrect metric values.