cava_nlp¶
cava_nlp is a spaCy pipeline designed for real-world clinical text, with
a specific focus on cancer-specific text: pathology reports, progress notes,
registry extracts, and free-text fields shaped by clinical workflows rather than linguistic norms.
It prioritises:
- notation-heavy (e.g.
mg/kg,10^9,ECOG 1) - inconsistently spaced or punctuated
- rich in abbreviations and symbols
- structured meaning across critical token types (e.g. dates, dosages, measurements)
1. Create a CaVa pipeline¶
cava_nlp pipelines are constructed explicitly and do not rely on pretrained language models.
Out of the box, this provides:
- deterministic clinical tokenisation
- whitespace normalisation
- email masking prior to tokenisation
- medSpaCy sentence splitting
The original structure of the text is preserved while protecting tokenisation and downstream rules from incidental artefacts.
2. Add clinical normalisation¶
Clinical text often expresses structured meaning across multiple tokens
(e.g. decimals, dates, units).
The clinical normaliser merges these spans and assigns canonical forms.
Inspecting tokens:
Example output:
Key points:
- spans may be merged into single tokens
token.norm_provides the canonical representation- structured values are stored on token extensions
- the original text remains intact
3. Normalisation produces span groups¶
Normalised tokens are grouped automatically:
These span groups allow downstream components to reason over normalised clinical concepts without re-parsing raw text.
4. Add a rule engine¶
Rule engines layer domain-specific meaning on top of normalisation. Each engine is added independently.
nlp.add_pipe(
"rule_engine",
name="ecog_value",
config={
"engine_config_path": None,
"component_name": "ecog_status",
},
)
Processing text:
Results:
[(ent.text, ent.label_) for ent in doc.ents]
# [('ECOG 1', 'ecog_status')]```
Structured values are available via span attributes:
```python
doc.spans["ecog_status"][0]._.value
# 1
Rule engines support:
- value extraction and aggregation
- exclusions and fallbacks
- literal or computed assignments
- emission of both entities and span groups
Relevant contributions¶
-
Tokenisation
Clinical-oriented, deterministic, and explicit. -
Normalisation
Merge spans, assign canonical forms, attach structure. -
Rule engines
Map patterns to domain meaning and values based on configuration definitions.