Alignment¶

Alignment decides which predicted rows are compared against which gold rows. The library has two alignment strategies: indexed matching and entity matching.

`IndexAligner`¶

IndexAligner is used for SINGLE_FEATURE and SINGLE_ENTITY.

Current behavior:

reads run_config.index_key_name
builds a map from each gold key value to one gold row index
iterates through predicted rows and matches rows with the same key
returns matched pairs with similarity score 1.0
also returns unmatched predicted indices and unmatched gold indices

Important implications¶

matching is exact equality on the key field
duplicate key values in the gold DataFrame are ambiguous and should be avoided
unmatched rows are not included in per-feature scoring

`EntityAligner`¶

EntityAligner is used for MULTI_ENTITY.

Current behavior:

build candidate predicted/gold pairs
compute similarity for each pair
drop pairs below minimum_similarity_threshold
optionally keep only the top maximum_candidate_pairs
shuffle candidates using a fixed seed
stable-sort by similarity descending
greedily choose non-overlapping pairs

The result is deterministic for a fixed seed and input order.

Matching modes¶

Exact matching¶

When matching_mode="exact":

the aligner checks only features where is_mandatory_for_matching=True
if any mandatory feature is unequal, pair similarity is 0.0
otherwise pair similarity is 1.0

This is strict matching. There is no partial credit.

Weighted matching¶

When matching_mode="weighted":

every feature contributes a similarity score
the final score is a weighted average
weights come from FeatureRule.weight_for_matching

Feature-level similarity rules:

text: token-set overlap after normalization
category: 1.0 if equal, else 0.0
number: 1.0 if equal under tolerance rules, else 0.0
date: 1.0 if equal under date tolerance rules, else 0.0

Threshold behavior¶

minimum_similarity_threshold applies after pair similarity is computed.

pairs below the threshold are ignored entirely
pairs at or above the threshold remain candidates for greedy selection

Practical effect:

a high threshold increases precision of entity matching
a low threshold increases recall of candidate pairs but can create more ambiguous competition

Determinism and tie-breaking¶

For weighted matching, the aligner:

seeds Python's random module with random_tie_breaker_seed
shuffles candidate pairs
then sorts by similarity descending using a stable sort

This means equal-score pairs are resolved reproducibly for a given seed.

What alignment affects downstream¶

matched pairs feed the per-feature metrics and row accuracy calculations
unmatched predicted and unmatched gold entities feed the multi-entity presence summary
indexed tasks do not expose entity presence metrics because alignment is assumed to be explicit