cava_nlp.normalisation.normaliser¶
NormalisationResult
dataclass
¶
Structured output returned by all normaliser compute() methods.
Attributes:
| Name | Type | Description |
|---|---|---|
norm |
str
|
The canonical string representation assigned as the token's |
attrs |
Dict[str, Any]
|
A mapping of token extension names to values. These are written directly into the merged token using:: Examples include: - kind="decimal" - value=5.4 - unit="kg" - date_obj=datetime(...) Because attrs is a free-form dictionary, each normaliser can specify arbitrary structured metadata relevant to its domain. |
BaseNormalizer ¶
Defines the interface and processing flow used by normalisation components.
Each subclass implements a spaCy Matcher and compute() method to extracts
structured information from matched spans, plus optional token-level extensions.
Methods:
| Name | Description |
|---|---|
compute |
Override to define how to normalise the matched span. |
get_spans |
Returns matched spans after filtering overlaps. |
apply |
Applies the normaliser to the document, merges spans, assigns NORM, and populates token attributes from the compute() result. |
DecimalNormalizer ¶
Bases: BaseNormalizer
This is required because of the harsh tokenization in this pipeline - otherwise would be core spaCy functionality. "Temp is 36.9 today" token ["36.9"] norm="36.9", value=36.9, kind="decimal"
SciNotNormalizer ¶
Bases: BaseNormalizer
"WCC was 10^9 today" token ["10^9"], norm="10.0^9", value=1_000_000_000, kind="scientific", base=10.0, exp=9
DateNormalizer ¶
Bases: BaseNormalizer
TimeNormalizer ¶
Bases: BaseNormalizer
UnitNormalizer ¶
Bases: BaseNormalizer