Skip to content

cava_nlp.normalisation.normaliser

NormalisationResult dataclass

NormalisationResult(norm: str, attrs: Dict[str, Any] = dict())

Structured output returned by all normaliser compute() methods.

Attributes:

Name Type Description
norm str

The canonical string representation assigned as the token's NORM attribute after merging. This replaces token.norm_ for the merged span, enabling downstream components to work with a consistent, normalised textual representation (e.g., "5.4", "10^9", "2024-01-03").

attrs Dict[str, Any]

A mapping of token extension names to values. These are written directly into the merged token using::

setattr(token._, ext_name, value)

Examples include: - kind="decimal" - value=5.4 - unit="kg" - date_obj=datetime(...)

Because attrs is a free-form dictionary, each normaliser can specify arbitrary structured metadata relevant to its domain.

BaseNormalizer

BaseNormalizer(nlp)

Defines the interface and processing flow used by normalisation components.

Each subclass implements a spaCy Matcher and compute() method to extracts structured information from matched spans, plus optional token-level extensions.

Methods:

Name Description
compute

Override to define how to normalise the matched span.

get_spans

Returns matched spans after filtering overlaps.

apply

Applies the normaliser to the document, merges spans, assigns NORM, and populates token attributes from the compute() result.

compute

compute(span) -> NormalisationResult

Override in subclass

apply

apply(doc)

Run matcher on doc and merge spans.

DecimalNormalizer

DecimalNormalizer(nlp)

Bases: BaseNormalizer

This is required because of the harsh tokenization in this pipeline - otherwise would be core spaCy functionality. "Temp is 36.9 today" token ["36.9"] norm="36.9", value=36.9, kind="decimal"

apply

apply(doc)

Run matcher on doc and merge spans.

SciNotNormalizer

SciNotNormalizer(nlp)

Bases: BaseNormalizer

"WCC was 10^9 today" token ["10^9"], norm="10.0^9", value=1_000_000_000, kind="scientific", base=10.0, exp=9

apply

apply(doc)

Run matcher on doc and merge spans.

DateNormalizer

DateNormalizer(nlp)

Bases: BaseNormalizer

apply

apply(doc)

Run matcher on doc and merge spans.

TimeNormalizer

TimeNormalizer(nlp)

Bases: BaseNormalizer

apply

apply(doc)

Run matcher on doc and merge spans.

UnitNormalizer

UnitNormalizer(nlp)

Bases: BaseNormalizer

apply

apply(doc)

Run matcher on doc and merge spans.

ClinicalNormalizer

ClinicalNormalizer(nlp)