Skip to main content

The default solver

caution

This feature is still in Beta ๐Ÿงช. As such, you should not expect it to be 100% stable or be free of bugs. Any public CLI or Python interfaces may change without prior notice.

If you find any bugs or feel like something is not behaving as it should, feel free to open an issue on the MetricFlow Github repo.

As mentioned in the overview, after all rules are done processing, we still need some policy to coerce rule signals into a final type for each column. This is the job of the InferenceSolver.

The default solver provided by MetricFlow is the WeightedTypeTreeInferenceSolver. As mentioned, you shouldn't need to implement your own solver unless you have a really specific case at hand. That said, understanding how the default solver works should make it easier to write well-functioning rule sets and to debug cases where there is confusion.

The type treeโ€‹

You might have noticed InferenceSignal has a type_node parameter that takes in an InferenceTypeNode. The "node" terminology actually derives from the fact that data source inference organizes column types in a tree hierarchy. Whenever a rule says some column might be some type, it is actually refers to that type as a node in the type tree.

The type tree is constructed in with two very basic rules in mind:

  • Nodes belonging to separate branches in the tree are always conflicting. In other words, a column cannot be both of type A and B if A and B are not on the same branch.
  • Child nodes are always specializations of the parent node's type.

Here's what the type tree actually looks like:

Data source inference type tree

Some examples of the type tree rules in action:

  • A column cannot be both a dimension and an identifier because they are on separate branches in the tree (they are siblings).
  • Similarly, a column cannot be both a foreign identifier and a measure at the same time.
  • primary identifier is a child of unique identifier because it specializes the parent type, i.e, a primary identifier is always also a unique identifier. By transitivity, a primary identifier always also an identifier (duh).

You can already start to see how the type tree can be used to solve a column's type and to check for conflicts or confusion in a list of signals.

Traversing the type treeโ€‹

The WeightedTypeTreeInferenceSolver makes use of these properties to traverse it and find the most appropriate type for a column given a list of signals for it. It performs the following steps for each column (ignoring complementary signals for now):

  1. Using a weight attribution function, assign numerical weights to each type node according confidence values (LOW, MEDIUM, HIGH, VERY_HIGH);
  2. Calculate cumulative weights for each node. The cumulative weight of each node is given by the sum of its children's cumulative weights plus its own weight;
  3. Starting from the root, traverse the tree through the path with the most weight, stopping at either a leaf or a node which only has children with weight zero. It can also stop if there is a "weight bifurcation", that is, two paths with approximately equal weight with respect to a threshold (indicates confusion).

This process allows the solver to confidently produce outputs when signals all point to the same type while still getting to the most specific type when there is conflicting information. It also allows it to produce a list of reasons as to why a column was (or wasn't) resolved to a certain type, as well as a detailed explanation for why something went wrong.

Example 1โ€‹

Let's take a careful look at how WeightedTypeTreeInferenceSolver would behave in a real scenario.

Consider the following:

  • Our weight attribution function assigns scores 1, 2, 3 and 5 to LOW, MEDIUM, HIGH and VERY_HIGH confidences, respectively;
  • Our minimum weight threshold is 90%. In other words, the solver stops at an internal node and considers itself confused if there is no single sibling with at least 90% of the weight in that node.
  • Our rule set has produced a ID.PRIMARY signal with VERY_HIGH confidence and a ID.UNIQUE signal with HIGH confidence.

The corresponding weight attribution would be

Example 1: Weight attribution

which would make the solver follow the path

Example 1: Solver path

and result in a ID.PRIMARY final type for the target column.

Example 2โ€‹

Using the same solver and weight attribution function, let's look at what happens when there are contradicting (conflicting) signals.

Suppose our rule set has produced a DIMENSION.TIME signal with HIGH confidence and a DIMENSION.CATEGORICAL signal with MEDIUM confidence.

The corresponding weight attribution would be

Example 2: Weight attribution

which would make the solver follow the path

Example 2: Solver path

and result in a DIMENSION.UNKNOWN final type for the target column.

Note about complementary signalsโ€‹

complementary signals are signals that should only be considered together with other signals. To account for this, the WeightedTypeTreeInferenceSolver does not propagate a complementary signal's weight to ancestor nodes, effectively making those signals not influence decision-making at higher levels in the tree.

Troubleshootingโ€‹

Now that you know how the default solver works, here goes how to solve some common inference errors.

Error: No signals were extracted for this columnโ€‹

None of your rules produced signals for that column, so the solver has no information about it and thus cannot resolve its type. If this error is happening too frequently, make sure your rule set is broad and complete enough.

Error: Inference solver could not determine a type for this columnโ€‹

This means the solver stopped an internal node due to two or more paths having similar weights. You could solve this by:

  1. Lowering the solver's confusion tolerance threshold. Beware if you lower this too much you might start getting weird/wrong results;
  2. Investigating which rules are producing the confusing signals and calibrating their confidence values. You might also consider making some signals complementary if they are specific enough;
  3. Tuning your weight attribution function.

Lot of FIXME fields in the generated config filesโ€‹

Some rules are producing generic type nodes (such as ID.UNKNOWN or DIMENSION.UNKNOWN), but you have no rule to further differentiate subtypes (like ID.PRIMARY or DIMENSION.CATEGORICAL).