Evaluate a Multi-Entity Task¶
Scenario¶
You have lists of predicted and gold entities and there is no stable shared key between them. The evaluator must first decide which predicted entity matches which gold entity.
Runnable example¶
from typing import Optional
from pydantic import BaseModel
from extraction_testing.config import MatchingConfig
from extraction_testing import FeatureRule, RunConfig, TaskType, evaluate
class ContractRecord(BaseModel):
contract_title: str
contract_amount: Optional[float]
predicted_records = [
ContractRecord(contract_title="Supply Agreement Alpha", contract_amount=100000.0),
ContractRecord(contract_title="Maintenance Contract Beta", contract_amount=50520.0),
ContractRecord(contract_title="Acquisition Accord Zeta", contract_amount=300000.0),
]
gold_records = [
ContractRecord(contract_title="Supply Agreement Alpha", contract_amount=100000.0),
ContractRecord(contract_title="Maintenance Contract Beta", contract_amount=50500.0),
ContractRecord(contract_title="Licensing Agreement Epsilon", contract_amount=150000.0),
]
run_config = RunConfig(
task_type=TaskType.MULTI_ENTITY,
feature_rules=[
FeatureRule(feature_name="contract_title", feature_type="text", weight_for_matching=2.0),
FeatureRule(
feature_name="contract_amount",
feature_type="number",
numeric_absolute_tolerance=100.0,
weight_for_matching=1.0,
),
],
matching_config=MatchingConfig(
matching_mode="weighted",
minimum_similarity_threshold=0.6,
),
)
result_bundle = evaluate(predicted_records, gold_records, run_config)
print(result_bundle.per_feature_metrics_data_frame)
print(result_bundle.total_metrics_data_frame)
print("row_accuracy:", result_bundle.row_accuracy_value)
print(result_bundle.entity_detection_summary)
print(result_bundle.matched_pairs_data_frame)
What to expect¶
- the first two predicted entities should match the first two gold entities
- the third predicted entity should remain unmatched
- the third gold entity should remain missed
So the entity summary should show:
matched_countof2- one extra prediction
- one missed gold entity
precision_entitiesandrecall_entitiesboth around0.6667
The matched rows should still score well at the feature level because the amount difference for the beta contract is within the configured tolerance.
When this guide applies¶
Use this workflow when:
- entity identity must be inferred from content
- you need both feature-level quality and entity-detection quality
- row order should not be treated as identity