cava_nlp.normalisation.normaliser¶

NormalisationResult `dataclass` ¶

NormalisationResult(norm: str, attrs: Dict[str, Any] = dict())

Structured output returned by all normaliser compute() methods.

Attributes:

Name	Type	Description
`norm`	`str`	The canonical string representation assigned as the token's `NORM` attribute after merging. This replaces `token.norm_` for the merged span, enabling downstream components to work with a consistent, normalised textual representation (e.g., "5.4", "10^9", "2024-01-03").
`attrs`	`Dict[str, Any]`	A mapping of token extension names to values. These are written directly into the merged token using:: `setattr(token._, ext_name, value)` Examples include: - kind="decimal" - value=5.4 - unit="kg" - date_obj=datetime(...) Because attrs is a free-form dictionary, each normaliser can specify arbitrary structured metadata relevant to its domain.

BaseNormalizer ¶

BaseNormalizer(nlp)

Defines the interface and processing flow used by normalisation components.

Each subclass implements a spaCy Matcher and compute() method to extracts structured information from matched spans, plus optional token-level extensions.

Methods:

Name	Description
`compute`	Override to define how to normalise the matched span.
`get_spans`	Returns matched spans after filtering overlaps.
`apply`	Applies the normaliser to the document, merges spans, assigns NORM, and populates token attributes from the compute() result.

compute ¶

compute(span) -> NormalisationResult

Override in subclass

apply ¶

apply(doc)

Run matcher on doc and merge spans.

DecimalNormalizer ¶

DecimalNormalizer(nlp)

Bases: BaseNormalizer

This is required because of the harsh tokenization in this pipeline - otherwise would be core spaCy functionality. "Temp is 36.9 today" token ["36.9"] norm="36.9", value=36.9, kind="decimal"

apply ¶

apply(doc)

Run matcher on doc and merge spans.

SciNotNormalizer ¶

SciNotNormalizer(nlp)

Bases: BaseNormalizer

"WCC was 10^9 today" token ["10^9"], norm="10.0^9", value=1_000_000_000, kind="scientific", base=10.0, exp=9

apply ¶

apply(doc)

Run matcher on doc and merge spans.

DateNormalizer ¶

DateNormalizer(nlp)

Bases: BaseNormalizer

apply ¶

apply(doc)

Run matcher on doc and merge spans.

TimeNormalizer ¶

TimeNormalizer(nlp)

Bases: BaseNormalizer

apply ¶

apply(doc)

Run matcher on doc and merge spans.

UnitNormalizer ¶

UnitNormalizer(nlp)

Bases: BaseNormalizer

apply ¶

apply(doc)

Run matcher on doc and merge spans.

ClinicalNormalizer ¶

ClinicalNormalizer(nlp)

cava_nlp.normalisation.normaliser¶

NormalisationResult dataclass ¶

BaseNormalizer ¶

compute ¶

apply ¶

DecimalNormalizer ¶

apply ¶

SciNotNormalizer ¶

apply ¶

DateNormalizer ¶

apply ¶

TimeNormalizer ¶

apply ¶

UnitNormalizer ¶

apply ¶

ClinicalNormalizer ¶

NormalisationResult `dataclass` ¶