Data Source Configuration Best Practices
Introduction
Transform takes the physical data layer
of your data warehouse (rows and columns) and creates a semantic data layer
on top of it. MetricFlow creates this semantic layer based on your config files that describe how the data is stored in the data warehouse. This configuration is accomplished in YAML data source
configuration files. These are a list of common questions and best practices when attempting to optimize the performance of MetricFlow.
When should I use a SQL Table vs SQL Query?
SQL Table:
The table name in your data warehouse.
data_source:
name: data_source_name
description: My description
owners:
- owner_email1@company.com
sql_table: schema.table
SQL Query:
A query that runs against your data warehouse.
data_source:
name: data_source_name
description: My description
owners:
- owner_email1@company.com
sql_query: Select * from schema.table
Under the hood:
SQL Table is always preferred to a SQL Query because it pushes more logic construction into MetricFlow and enables more optimization. A SQL Query is materialized in a feature called "priming" to a table and then that table is treated as a SQL Table in MetricFlow.
Best Practices:
- Whenever possible use
sql_table
. This will be substantially more performant. - Whenever possible avoid joins in a
sql_query
. Instead, create a new data source. Transform's semantic layer will automate the joins for you! - If a
sql_query
is used, consider the most efficient mutability setting. By default, the primed table will be considered invalid on an hourly basis which makes an efficient mutability setting even more important.
What is an appropriate mutation setting for my data_source
?
Under the hood:
The mutability setting determines when the cache considers a data set invalid or when it needs to look at the underlying dataset to pull fresh data.
Full Mutation
By default, MetricFlow will assume all data constructed off of a full_mutation is invalid every hour. If you know when datasets are typically updated, say by 12 AM UTC each day, you should set an update_cron
for 1 PM UTC.
mutability:
type: full_mutation
type_params:
# 6:00 Pacific Time or 9:00 Eastern Time is 13:00 UTC
update_cron: 0 13 * * *
Immutable
If you decide to leverage the invalidate
command with Airflow, set your mutability to immutable
to prevent the cache from invalidating twice.
mutability:
type: immutable
Best Practices
- If you use
full_mutation
specify anupdate_cron
time to schedule the priming. If you do not, a priming query will be run every hour which can be expensive and may impact the experience your end users have consuming data through the APIs.
When do I use denormalized vs normalized tables?
Best Practices
- Either works! But, you will likely get more from MetricFlow by using
normalized
tables asdata sources
. A normalized table will maximize your ability to produce metrics to a variety of granularities consistently. - A denormalized table is already aggregated and therefore cannot produce metrics to nearly as many granularities.
- On the other hand, a raw table may not be consistent with other data sets in your warehouse.
- Transform's semantic layer is capable of most denormalization tasks (aggregation, filtering, joining) and we're constantly seeking to fill any gaps.
Example
For an Example Data Source and Metrics written against a data model using these best practices, see the Introduction to MetricFlow