Alignment¶
Alignment decides which predicted rows are compared against which gold rows. The library has two alignment strategies: indexed matching and entity matching.
IndexAligner¶
IndexAligner is used for SINGLE_FEATURE and SINGLE_ENTITY.
Current behavior:
- reads
run_config.index_key_name - builds a map from each gold key value to one gold row index
- iterates through predicted rows and matches rows with the same key
- returns matched pairs with similarity score
1.0 - also returns unmatched predicted indices and unmatched gold indices
Important implications¶
- matching is exact equality on the key field
- duplicate key values in the gold DataFrame are ambiguous and should be avoided
- unmatched rows are not included in per-feature scoring
EntityAligner¶
EntityAligner is used for MULTI_ENTITY.
Current behavior:
- build candidate predicted/gold pairs
- compute similarity for each pair
- drop pairs below
minimum_similarity_threshold - optionally keep only the top
maximum_candidate_pairs - shuffle candidates using a fixed seed
- stable-sort by similarity descending
- greedily choose non-overlapping pairs
The result is deterministic for a fixed seed and input order.
Matching modes¶
Exact matching¶
When matching_mode="exact":
- the aligner checks only features where
is_mandatory_for_matching=True - if any mandatory feature is unequal, pair similarity is
0.0 - otherwise pair similarity is
1.0
This is strict matching. There is no partial credit.
Weighted matching¶
When matching_mode="weighted":
- every feature contributes a similarity score
- the final score is a weighted average
- weights come from
FeatureRule.weight_for_matching
Feature-level similarity rules:
text: token-set overlap after normalizationcategory:1.0if equal, else0.0number:1.0if equal under tolerance rules, else0.0date:1.0if equal under date tolerance rules, else0.0
Threshold behavior¶
minimum_similarity_threshold applies after pair similarity is computed.
- pairs below the threshold are ignored entirely
- pairs at or above the threshold remain candidates for greedy selection
Practical effect:
- a high threshold increases precision of entity matching
- a low threshold increases recall of candidate pairs but can create more ambiguous competition
Determinism and tie-breaking¶
For weighted matching, the aligner:
- seeds Python's
randommodule withrandom_tie_breaker_seed - shuffles candidate pairs
- then sorts by similarity descending using a stable sort
This means equal-score pairs are resolved reproducibly for a given seed.
What alignment affects downstream¶
- matched pairs feed the per-feature metrics and row accuracy calculations
- unmatched predicted and unmatched gold entities feed the multi-entity presence summary
- indexed tasks do not expose entity presence metrics because alignment is assumed to be explicit