The default solver
caution
This feature is still in Beta ๐งช. As such, you should not expect it to be 100% stable or be free of bugs. Any public CLI or Python interfaces may change without prior notice.
If you find any bugs or feel like something is not behaving as it should, feel free to open an issue on the MetricFlow Github repo.
As mentioned in the overview, after all rules are done processing, we still need some policy to coerce rule signals into a final type for each column. This is the job of the InferenceSolver
.
The default solver provided by MetricFlow is the WeightedTypeTreeInferenceSolver
. As mentioned, you shouldn't need to implement your own solver unless you have a really specific case at hand. That said, understanding how the default solver works should make it easier to write well-functioning rule sets and to debug cases where there is confusion.
The type treeโ
You might have noticed InferenceSignal
has a type_node
parameter that takes in an InferenceTypeNode
. The "node" terminology actually derives from the fact that data source inference organizes column types in a tree hierarchy. Whenever a rule says some column might be some type, it is actually refers to that type as a node in the type tree.
The type tree is constructed in with two very basic rules in mind:
- Nodes belonging to separate branches in the tree are always conflicting. In other words, a column cannot be both of type A and B if A and B are not on the same branch.
- Child nodes are always specializations of the parent node's type.
Here's what the type tree actually looks like:
Some examples of the type tree rules in action:
- A column cannot be both a dimension and an identifier because they are on separate branches in the tree (they are siblings).
- Similarly, a column cannot be both a foreign identifier and a measure at the same time.
- primary identifier is a child of unique identifier because it specializes the parent type, i.e, a primary identifier is always also a unique identifier. By transitivity, a primary identifier always also an identifier (duh).
You can already start to see how the type tree can be used to solve a column's type and to check for conflicts or confusion in a list of signals.
Traversing the type treeโ
The WeightedTypeTreeInferenceSolver
makes use of these properties to traverse it and find the most appropriate type for a column given a list of signals for it. It performs the following steps for each column (ignoring complementary signals for now):
- Using a weight attribution function, assign numerical weights to each type node according confidence values (
LOW
,MEDIUM
,HIGH
,VERY_HIGH
); - Calculate cumulative weights for each node. The cumulative weight of each node is given by the sum of its children's cumulative weights plus its own weight;
- Starting from the root, traverse the tree through the path with the most weight, stopping at either a leaf or a node which only has children with weight zero. It can also stop if there is a "weight bifurcation", that is, two paths with approximately equal weight with respect to a threshold (indicates confusion).
This process allows the solver to confidently produce outputs when signals all point to the same type while still getting to the most specific type when there is conflicting information. It also allows it to produce a list of reasons as to why a column was (or wasn't) resolved to a certain type, as well as a detailed explanation for why something went wrong.
Example 1โ
Let's take a careful look at how WeightedTypeTreeInferenceSolver
would behave in a real scenario.
Consider the following:
- Our weight attribution function assigns scores 1, 2, 3 and 5 to
LOW
,MEDIUM
,HIGH
andVERY_HIGH
confidences, respectively; - Our minimum weight threshold is 90%. In other words, the solver stops at an internal node and considers itself confused if there is no single sibling with at least 90% of the weight in that node.
- Our rule set has produced a
ID.PRIMARY
signal withVERY_HIGH
confidence and aID.UNIQUE
signal withHIGH
confidence.
The corresponding weight attribution would be
which would make the solver follow the path
and result in a ID.PRIMARY
final type for the target column.
Example 2โ
Using the same solver and weight attribution function, let's look at what happens when there are contradicting (conflicting) signals.
Suppose our rule set has produced a DIMENSION.TIME
signal with HIGH
confidence and a DIMENSION.CATEGORICAL
signal with MEDIUM
confidence.
The corresponding weight attribution would be
which would make the solver follow the path
and result in a DIMENSION.UNKNOWN
final type for the target column.
Note about complementary signalsโ
complementary signals are signals that should only be considered together with other signals. To account for this, the WeightedTypeTreeInferenceSolver
does not propagate a complementary signal's weight to ancestor nodes, effectively making those signals not influence decision-making at higher levels in the tree.
Troubleshootingโ
Now that you know how the default solver works, here goes how to solve some common inference errors.
Error: No signals were extracted for this columnโ
None of your rules produced signals for that column, so the solver has no information about it and thus cannot resolve its type. If this error is happening too frequently, make sure your rule set is broad and complete enough.
Error: Inference solver could not determine a type for this columnโ
This means the solver stopped an internal node due to two or more paths having similar weights. You could solve this by:
- Lowering the solver's confusion tolerance threshold. Beware if you lower this too much you might start getting weird/wrong results;
- Investigating which rules are producing the confusing signals and calibrating their confidence values. You might also consider making some signals complementary if they are specific enough;
- Tuning your weight attribution function.
Lot of FIXME fields in the generated config filesโ
Some rules are producing generic type nodes (such as ID.UNKNOWN
or DIMENSION.UNKNOWN
), but you have no rule to further differentiate subtypes (like ID.PRIMARY
or DIMENSION.CATEGORICAL
).