Skip to main content

Data Sources

Overview

Transform’s framework exists as a metric semantic layer defined via abstractions in YAML. The starting point for defining data begins with the data source. You can think of the data sources as nodes, connected via the joining keys as edges.

A data source is usually a table within your data warehouse. For each data source, create a configuration YAML in your Transform repository. You may also create as many data sources out of each table as you like, provided that you give each data source a different name.

See the example file below for a full config as well as a more detailed description of each field.

Example:

data_source: # This section includes metadata on the data source and the source definition.
name: transactions # Define the name of the source. This name can be updated in the future.
description: | # Provide a detailed description of the data source here and include any important details. Other configuration contributors will primarily use this description.
# It is only surfaced in the UI under the lineage view on the metric page.
This table captures every transaction starting July
02, 2014. Each row represents one transaction. There
will be a new row for any cancellations or alterations.
There is a transaction, order, and user id for
every transaction. There is only one transaction id per
transaction, but there can be many rows per order id and
user id. The `ds` or date is reflected in UTC.
owners: # List the emails of the owners of this data source. This list is currently not used for notifications, but it will be in the future.
- support@transformdata.io
sql_table: demo_schema.transactions # The table from which the data source is constructed. You can also pass in the database name first.
# sql_query: You can also pass in a SQL Query here. This allows you to aggregate or filter the data before passing it into Transform.

identifiers:
# Define the primary, foreign, or unique key columns in your source that can be used to join to other data sources.
# Each join key should have a type of either primary, foreign, or unique.
# Additionally, keys can be referenced directly by column name from data source, or they can be altered using an expression.
- name: id_transaction
type: primary
- name: id_order
type: foreign
- name: id_user
type: foreign
expr: SUBSTRING(id_order from 2)

measures:
# Define the fields from your data source to be aggregated as inputs to metrics (e.g. in metrics.yaml).
# Each measure has an aggregation and optional description. A measure can simply reference a column or may be calculated using a SQL expression.
# Measures have a default expression of their name. The available aggregations are sum, max, min, count_distinct, and sum_boolean.
- name: transaction_amount_usd
description: The total USD value of the transaction.
agg: sum
- name: transactions
description: The total number of transactions.
expr: "1"
agg: sum
- name: quick_buy_amount_usd
description: The total USD value of the transactions that were
purchased using the “quick buy” button.
expr: CASE WHEN transaction_type_name = ‘quick buy’ THEN transaction_amount_usd ELSE 0 END
agg: sum
- name: quick_buy_transactions
description: The total transactions bought as quick buy.
expr: quick_buy_flag
agg: sum_boolean

dimensions:
# Define the dimensions from your source. Dimensions are qualitative values such as names, dates, or geographical data.
# Dimensions provide context to measures and are associated with metrics created from those measures to provide “metric by dimension” data slicing.
# Dimensions can either directly reference a column or may be calculated using a SQL expression.
- name: ds
type: time
type_params:
is_primary: # true or false
time_granularity: day # only format currently supported
- name: quick_buy_transaction
type: categorical
expr: |
CASE
WHEN transaction_type_name = ‘quick_buy’ THEN
ELSE ‘not_quick_buy’
END

mutability:
# Define the appropriate mutability for this data source. Mutability refers to how the data underlying this configuration (a sql table or the results of a sql query) changes.
type: full_mutation
type_params:
# 6:00 Pacific Time or 9:00 Eastern Time is 13:00 UTC
update_cron: 0 13 * * *

Data Source

This section includes metadata on the data source and the source definition.

Name (required)

Define the name of the source. The Transform model will refer to this name when identifying a data source. You may update this name at any time, but all data sources must have a unique name as the MetricFlow framework refers to the name to differentiate data sources from each other.

Description (optional)

Provide a detailed description of the data source here and include any important details. Other configuration contributors will primarily use this description. It is only surfaced in the UI under the lineage view on the metric page. Note the use of the pipe operator (|), which allows for multiple lines in the description.

Example:

description: |
This table captures every transaction starting July
02, 2014. Each row represents one transaction. There
will be a new row for any cancellations or alterations.
There is a transaction, order, and user id for
every transaction. There is only one transaction id per
transaction, but there can be many rows per order id and
user id. The `ds` or date is reflected in UTC.

Owners (optional)

List the emails of the owners of this data source. This list is currently not used for notifications, but it will be in the future.

Example:

owners:
- support@transformdata.io

sql_table or sql_query (required - Choose One)

There are two types of data sources: sql_table or sql_query.

sql_table: Choose this option to indicate the schema and table name of the data table in your data warehouse.

sql_table: demo_schema.transactions

If you want to reference the database in data warehouses such as Snowflake, you can do so directly in the sql_table or sql_query (example below):

sql_table: demo_db.demo_schema.transactions

sql_query: If using a full query, use your local data warehouse query language.

tip

A Table is preferred over a SQL Query as it allows us to perform more efficient queries against source tables. Where possible use a table and move constraint logic to the metrics.

Below, we have an example of a data source that is utilizing a sql_query.

sql_query: |
SELECT
t.id AS id_transaction
, t.id_order
, t.id_user
, t.transaction_amount
, t.transaction_amount_usd
, tt.transaction_type_name
, t.ds
FROM
demo_schema.transactions t
JOIN
demo_schema.transaction_type tt
ON
tt.id = t.id_transaction_type
tip

If you have a fairly long query that may extend over multiple lines, you may place the pipe "|" to indicate multiple lines as seen in the example above. YAML is also agnostic to the order of objects; you may move sections out of order.

dbt_model

If you use dbt for data transformations, you can use your dbt models as data sources. Follow the integration setup, and you'll be able to include your dbt model in the dbt_model parameter.

Note that you must still specify a value in the sql_table or sql_query parameter.

Example:

sql_table: demo_dbt.transactions # Refers to the dbt model output in the form of "schema.table", where the target schema is specified in the dbt profiles.yml file 
dbt_model: demo_dbt.transactions # The format consists of target_schema.model_name, where target_schema is specified in the dbt profiles.yml file, and the model_name refers to the dbt model (usually the filename itself).
# In most cases, the sql_table and dbt_model parameters will have the exact same text inputs, but there may be cases where the identifier of the model is overridden using an alias model configuration. In the case of an alias name being present for the model, you will want to input target_schema.alias_model_name for the dbt_model parameter.

Identifiers (required)

Within the identifiers section, define the primary, foreign, or unique key columns in your data source via the parameters name and type. Each join key must have a type of either primary, foreign, or unique. For further details, see the Identifiers section.

Identifier Types:

  • Primary: A primary key has one and only one record for each row in the table, and it is also inclusive of every record in the data warehouse

  • Unique: A unique key has one and only one record for each row in the table, but it may only have a subset of every record in the data warehouse. They can also have nulls.

  • Foreign: A foreign key can have zero, one or many of the same records. They can also have nulls.

Sample configuration below:

  • Below, we have a data source that has 3 identifier keys -- transaction (primary), order (foreign), and user (foreign).
  • In the name parameter, you may reference the desired join key via the actual column name from the originating data table. You can also use the name parameter as an alias for renaming purposes and utilize the expr parameter to refer to either the original column name itself or a SQL expression of the column.
identifiers:
- name: transaction
type: primary
- name: order
type: foreign
expr: id_order
- name: user
type: foreign
expr: SUBSTRING(id_order FROM 2)

Note that the MetricFlow framework refers to identifiers (join keys) in a data source via the name parameter. Within a data source, identifier names must be unique (you cannot have repeated identifier names within a data source just like how you cannot have repeated column names within a data table). Across data sources, identifier names do not have to be unique. For more details and clarification on how MetricFlow traverses the proper join paths via identifier names, see the Join Logic section.

MetricFlow also supports combining keys that cannot stand alone into a composite key (a common example would be event logs, where a truly unique identifier would be a combination of timestamp, event type keys, and machine IDs). For further details on composite keys, see the Identifiers section.

Dimensions

Dimensions refer to the various data cuts (slices and dices) you would like to see, or group by for a metric. Fur further reading, see the Dimensions section.

When writing a query to calculate a metric, one often has to think about the various cuts and slices that a metric should be grouped by (i.e. region, country, user role, department, etc.). Such a process is often time-consuming and prone to multiple back-and-forth sessions. MetricFlow reduces this step by traversing the appropriate join paths and generating the available dimensions for your metric.

Parameters to specify for dimensions include:

  • the name parameter, which can either directly reference a column within the data source or may be calculated using a SQL expression via the expr parameter.
  • the type parameter, which indicates the type of dimension. Currently, Transform supports categorical and time dimensions. Categorical dimensions are qualitative values, such as names or geographical data. Time dimensions are dates of varying granularity.

Sample configuration below: Below, we have a dimension named ds specified via the type parameter as a time dimension. We also have a second dimension of categorical type named quick_buy_transaction, which derives from a SQL expression in the expr parameter.

dimensions:
- name: ds
type: time
type_params:
is_primary: True
time_granularity: day
- name: quick_buy_transaction
type: categorical
expr: |
CASE
WHEN transaction_type_name = ‘quick_buy’
THEN 1
ELSE 0
END

Similar to identifiers, dimensions in the MetricFlow framework are identified via the name parameter. Therefore, the naming of dimensions within a data source must be unique, but the naming of dimensions across data sources is not required to be unique as MetricFlow will figure out the appropriate dimensions via identifiers (see the Join Logic section).

Note for Time dimensions: For data sources with a measure involved, a primary time dimension is required (notice the is_primary: True parameter).

  • Set the is_primary parameter to True to indicate the time dimension that is to be the primary, or preferred, time dimension for a measure (or metric).
    • Example: Suppose you have a list of clients in a table with 2 date columns: one as the date of client creation in the system, and the other as the client contract closing date. In MetricFlow, you may specify both of these time dimensions in the data source, and indicate which date column is the preferred time dimension via is_primary: True.
  • The primary time dimension is used for graphing the x-axis within the Transform UI. For materializations, the primary time dimension is used as the common time dimension for all metrics.
  • Transform currently supports the following granularities: day, week, month, quarter, and year.
  • For more discussion on time dimensions, see dimensions

Measures

Measures are the aggregations that can be either your final metric as desired, and/or serve as the building blocks for a metric.

Parameters associated with measures include:

  • name: Required. Provide a name for the measure. Measure names must be unique and cannot be repeated across all data sources in your Transform model.
  • description: Optional. Provide a description for the measure
  • agg: The currently available aggregations in Transform are sum, max, min, count_distinct, and sum_boolean.
  • expr: You may directly reference an underlying column name from the table itself, or provide a SQL expression to calculate/derive a new column.

Example:

measures:
- name: transaction_amount_usd
description: The total USD value of the transaction.
agg: sum
- name: transactions
description: The total number of transactions.
expr: "1"
agg: sum
- name: quick_buy_amount_usd
description: The total USD value of the transactions that were
purchased using the “quick buy” button.
expr: CASE WHEN transaction_type_name = ‘quick buy’ THEN transaction_amount_usd ELSE 0 END
agg: sum
- name: quick_buy_transactions
description: The total transactions bought as quick buy.
expr: quick_buy_flag
agg: sum_boolean

Mutability Configuration

caution

An incorrect mutability setting will lead to inaccurate metric values. To avoid this, be sure to understand the various options and the update process for the underlying dataset

Mutability refers to how the underlying data in the configuration (a sql table or the results of a sql query) changes. For a further detailed discussion, please see the Mutability section.

A quick overview of the available options for mutability:

immutable: the data source rarely changes and MetricFlow only runs the query once.

  • Example: A data source referring to state abbreviations and names will rarely change, thus suiting an immutable setting.
  • This is the setting to use if leveraging the mql invalidate-caches command with Airflow

full mutation: the data changes frequently, and needs to be updated on a periodic basis.You may specify a cron schedule as to when the underlying cache should be refreshed.

For more information, please visit Mutability Settings.