extraction-testing¶
extraction-testing is a typed Python library for evaluating structured extraction outputs against gold data. It supports single-label classification-style checks, single-entity field comparison, and multi-entity matching with per-feature scoring, aggregate metrics, and optional human-readable logs.
Supported task types¶
SINGLE_FEATURE: evaluate one extracted feature per indexed recordSINGLE_ENTITY: evaluate multiple features for one indexed entity per recordMULTI_ENTITY: evaluate lists of predicted and gold entities using entity matching before scoring
Start here¶
Why this library exists¶
This library is designed to make evaluation-driven development practical for extraction workflows. In many teams, building a first workflow is manageable, but building the comparison logic, metric calculations, and reporting layer around it is the part that gets skipped or delayed.
extraction-testing exists to standardize that evaluation layer so teams can spend more effort on the gold set, on the workflow itself, and on structured iteration with domain experts.
Evaluation flow¶
- Define one or more
FeatureRuleobjects for the fields you want to compare. - Build a
RunConfigwith the correcttask_typeand any required alignment settings. - Call
evaluate(predicted_records, gold_records, run_config)with lists of Pydantic models. - Inspect the returned
ResultBundletables and, if needed, write a text log withRunLogger.
What the docs cover¶
- Getting Started shows installation, the shortest working example, and how to choose a task type.
- Concepts explains why the library exists and the semantics behind feature normalization, alignment, metrics, logging, and visualization.
- How-To Guides provide cookbook-style workflows for common evaluation tasks.
- API Reference maps the public package surface and will expand into module-level reference pages.