Interpret Results¶

Scenario¶

You already have a ResultBundle and need to decide what the tables and summary fields are telling you.

Runnable example¶

from pydantic import BaseModel

from extraction_testing import FeatureRule, RunConfig, TaskType, evaluate


class ArticleRecord(BaseModel):
    row_identifier: int
    headline_text: str
    author_name: str


predicted_records = [
    ArticleRecord(row_identifier=1, headline_text="Market Rally", author_name="Jane Doe"),
    ArticleRecord(row_identifier=2, headline_text="Local Sports Win", author_name="J. Smith"),
]

gold_records = [
    ArticleRecord(row_identifier=1, headline_text="Market Rally", author_name="Jane Doe"),
    ArticleRecord(row_identifier=2, headline_text="Local Sports Win", author_name="John Smith"),
]

run_config = RunConfig(
    task_type=TaskType.SINGLE_ENTITY,
    feature_rules=[
        FeatureRule(feature_name="headline_text", feature_type="text"),
        FeatureRule(
            feature_name="author_name",
            feature_type="text",
            alias_map={"J. Smith": "John Smith"},
        ),
    ],
    index_key_name="row_identifier",
)

result_bundle = evaluate(predicted_records, gold_records, run_config)

print(result_bundle.per_feature_metrics_data_frame)
print(result_bundle.total_metrics_data_frame)
print("row_accuracy:", result_bundle.row_accuracy_value)

How to read the result¶

per_feature_metrics_data_frame tells you which feature is strong or weak
total_metrics_data_frame gives a compact summary across features
row_accuracy_value tells you how often the entire row was correct at once

In this example:

both features should score perfectly because the author alias makes the second row correct
row_accuracy_value should be 1.0 because both rows are fully correct

If you removed the alias map:

author_name would weaken
total metrics would drop
row accuracy would also drop because the second row would no longer be fully correct

Questions to ask when reading results¶

Is the failure concentrated in one feature or spread across many?
Is row accuracy much lower than per-feature accuracy?
For multi-entity tasks, is the real problem entity matching rather than field extraction?

Interpret Results¶

Scenario¶

Runnable example¶

How to read the result¶

Questions to ask when reading results¶

Related concepts¶