Oncology NLP Resource Library
Reusable prompts, extractors, schemas, and tools for clinical oncology NLP
Prompt Curation¶
Collaborative Prompt Library¶
A cross-site approach for reliable, versioned, and reproducible prompt engineering
Goal
Set up and maintain a shared prompt library
- Prompts are reproducible across teams and compute environments
- Outputs follow a common schema, so downstream pipelines (ETL, dashboards, validation interfaces) can rely on them
- Methods can be peer-reviewed, improved, and benchmarked transparently
- Tools and patterns can be re-used across cancer domains
- Version control tracks lineage of every change to prompt logic or output structure
This collaborative approach lets teams contribute LLM-based development into a shared ecosystem.
Prompt Library Operations¶
Prompt-O is a lightweight linkML-based toolchain with a few simple wrapper tools for common use-cases.
Each extraction target (e.g., conditions, performance status, radiotherapy region) is represented using a LinkML schema
LinkML provides
- A clear data contract
- Strong typing (e.g., enums, ranges, nested objects)
- Ability to generate pydantic models via
gen-pydantic - Portable schemas that can be reused (and extended) outside this project
- Validation of LLM outputs
- Automatic JSON conversion
- Type-safe integration with downstream python code
Note
These schemas define what an LLM must return. They do not dictate how extraction is done.
Prompts themselves are authored and stored in plain YAML files.
Each prompt definition file contains:
system: instructions to the modelinstruction: task descriptionoutput_model: the LinkML/Pydantic model to use (seeks tree_root class, or if not, defaults to camel case version of the file name)examples: zero-shot, few-shot, or scenario-based examples
name: oncology_condition_extraction
prompt_type: few_shot
system: >
Act as a medical data entry specialist.
You are reading notes from a clinical record that contains ...
instruction: >
Extract all conditions mentioned in the text...
output_model: condition_model
examples:
- input: "This 72-year-old male has a new diagnosis of prostate cancer"
output: |
{"conditions": [{"label":"prostate cancer", "verbatim_name": "prostate cancer", "codable_name": "prostate cancer", "who_diagnosed": "subject", "status":"active", "is_negated": false}]}
class WhoDiagnosed(str, Enum):
family = "family"
subject = "subject"
unknown = "unknown"
class OMOPEnum(ConfiguredBaseModel):
linkml_meta: ClassVar[LinkMLMeta] = LinkMLMeta({'abstract': True, 'from_schema': 'omop:convention'})
concept_name: Optional[str] = Field(default=None, json_schema_extra = { "linkml_meta": {'domain_of': ['OMOPEnum', 'ConditionDiagnosisSeverity']} })
concept_id: Optional[str] = Field(default=None, json_schema_extra = { "linkml_meta": {'annotations': {'prompt.skip': {'tag': 'prompt.skip', 'value': 'true'}},
'comments': ['this is populated during the grounding and normalization step'],
'domain_of': ['OMOPEnum']} })
class ConditionDiagnosisSeverity(OMOPEnum):
linkml_meta: ClassVar[LinkMLMeta] = LinkMLMeta({'annotations': {'meaning': {'tag': 'meaning', 'value': 'concept_id'}},'from_schema': 'ohdsi:condition'})
concept_name: Optional[ConditionDiagnosisSeverityEnum] = Field(default=None, json_schema_extra = { "linkml_meta": {'domain_of': ['OMOPEnum', 'ConditionDiagnosisSeverity']} })
concept_id: Optional[str] = Field(default=None, description="""concept_id for the enum label""", json_schema_extra = { "linkml_meta": {'annotations': {'prompt.skip': {'tag': 'prompt.skip', 'value': 'true'}},
'comments': ['this is populated during the grounding and normalization step'],
'domain_of': ['OMOPEnum']} })
class Condition(ConfiguredBaseModel):
linkml_meta: ClassVar[LinkMLMeta] = LinkMLMeta({'from_schema': 'ohdsi:condition'})
label: Optional[str] = Field(default=None, json_schema_extra = { "linkml_meta": {'domain_of': ['OMOPHierarchy', 'Condition']} })
verbatim_name: Optional[str] = Field(default=None, json_schema_extra = { "linkml_meta": {'domain_of': ['Condition']} })
who_diagnosed: Optional[WhoDiagnosed] = Field(default='unknown', json_schema_extra = { "linkml_meta": {'domain_of': ['Condition'], 'ifabsent': 'string("unknown")'} })
is_negated: Optional[bool] = Field(default=None, json_schema_extra = { "linkml_meta": {'domain_of': ['Condition']} })
severity: Optional[ConditionDiagnosisSeverity] = Field(default=None, json_schema_extra = { "linkml_meta": {'domain_of': ['Condition']} })
Recommended Workflow¶
- Identify an endpoint e.g. ECOG, radiotherapy region, variant of interest
- Define or update the LinkML schema, ensuring output fields match analytic needs
- The schema definitions can be submitted to the OntoGPT tool-specific configuration folder
- Generate the Pydantic model
- Create a new prompt file
- Write or refine the system prompt + examples, includin few-shot examples whenever possible
- Validate prompts
- Submit draft to prompt library for network validation & iteration
Note: Detailed instructions for steps 3 through 6 can be found at https://github.com/AustralianCancerDataNetwork/prompt-o