Configure Feature Rules¶
Scenario¶
You want one evaluation run to compare several feature types with the right normalization rules for each field.
Runnable example¶
from typing import Optional
from pydantic import BaseModel
from extraction_testing import FeatureRule, RunConfig, TaskType, evaluate
class ArticleRecord(BaseModel):
row_identifier: int
headline_text: str
view_count: Optional[int]
publish_date: Optional[str]
topic_label: str
feature_rules = [
FeatureRule(
feature_name="headline_text",
feature_type="text",
casefold_text=True,
strip_text=True,
remove_punctuation=True,
),
FeatureRule(
feature_name="view_count",
feature_type="number",
numeric_absolute_tolerance=5,
numeric_rounding_digits=0,
),
FeatureRule(
feature_name="publish_date",
feature_type="date",
date_tolerance_days=1,
),
FeatureRule(
feature_name="topic_label",
feature_type="category",
alias_map={"technology": "tech"},
),
]
predicted_records = [
ArticleRecord(
row_identifier=1,
headline_text="breaking market rally",
view_count=1003,
publish_date="2024-06-02",
topic_label="technology",
)
]
gold_records = [
ArticleRecord(
row_identifier=1,
headline_text="Breaking: Market Rally",
view_count=1000,
publish_date="2024-06-01",
topic_label="tech",
)
]
run_config = RunConfig(
task_type=TaskType.SINGLE_ENTITY,
feature_rules=feature_rules,
index_key_name="row_identifier",
)
result_bundle = evaluate(predicted_records, gold_records, run_config)
print(result_bundle.per_feature_metrics_data_frame)
What this configuration demonstrates¶
headline_textuses text normalization to ignore punctuation and caseview_countallows small numeric driftpublish_dateallows a one-day differencetopic_labeluses an alias map to collapse synonyms
All four features should compare as equal in this example.
Practical selection rules¶
- use
textfor free-form fields like titles or names - use
numberwhen tolerances or rounding matter - use
datewhen day-window matching matters - use
categoryfor canonical labels or enumerated values