nnUNet Data Preparation#

Open In Colab

The nnUNet is a self-configuring method for deep learning-based biomedical image segmentation. However it does require data to be formatted in a specific way on the file system. In this notebook, we demonstrate some useful functionality to prepare a dataset converted by PyDicer for training using nnUNet.

Note: PyDicer currently only supports nnUNet v1. Contributions adding support for nnUNet v2 are welcome.

[1]:
try:
    from pydicer import PyDicer
except ImportError:
    !pip install pydicer
    from pydicer import PyDicer

import os
import logging

from pathlib import Path

from pydicer.utils import fetch_converted_test_data

from pydicer.dataset.nnunet import NNUNetDataset
from pydicer.dataset.structureset import StructureSet

Setup nnUNet#

Consult the nnUNet documentation for details on how to install nnUNet, setup folder paths and conduct model training. The dataset will be prepared in the nnUNet_raw_data_base directory. If you already have this set in your environment you can remove the following cell. For demonstration purposes, we set our nnUNet_raw_data_base to a scratch directory.

[2]:
os.environ["nnUNet_raw_data_base"] = "./nnScratch"

Setup PyDicer#

For this example, we will use the LCTSC test data which has already been converted using PyDicer. We also initialise our PyDicer object.

For working with nnUNet, we set the PyDicer logging verbosity to INFO, so that we can see the relevant output being generated by the tool.

[3]:
working_directory = fetch_converted_test_data("./testdata_lctsc", dataset="LCTSC")
pydicer = PyDicer(working_directory)
pydicer.set_verbosity(logging.INFO)
Working directory %s aready exists, won't download test data.

Define Structures#

PyDicer uses the structure name mapping functionality to determine which structures to train the nnUNet model for. Here we add a structure name mapping for this task.

[4]:
mapping_id = "nnunet_lctsc"
mapping = {
    "Esophagus": [],
    "Heart": [],
    "Lung_L": ["L_Lung", "Lung_Left"],
    "Lung_R": ["Lung_Right"],
    "SpinalCord": ["SC"],
}

pydicer.add_structure_name_mapping(
    mapping_id=mapping_id,
    mapping_dict=mapping
)
pydicer.utils - INFO - Adding mapping for project in testdata_lctsc/.pydicer

Initialise NNUNetDataset object#

The NNUNetDataset class provides the functionality to prepare a dataset from PyDicer data. Here we create an object of this class for use in this example. Check out the documentation for more information on how the NNUNetDataset class works.

[5]:
nnunet_task_id = 123
nnunet_task_name = "LCTSC_Test"
nnunet_task_description = "A dummy nnUNet task for demonstration purposes"

nnunet = NNUNetDataset(
    working_directory,
    nnunet_task_id,
    nnunet_task_name,
    nnunet_task_description,
    mapping_id=mapping_id
)

Inspect Dataset#

Our NNUNetDataset tool expects to have exactly one image and one structure set per patient (multi-modal training not yet supported, contributions welcome). Let’s fetch our converted DataFrame to confirm that this is the case.

If your dataset isn’t yet in such as state, you can use the dataset preparation module in PyDicer to prepare a subset of data. Once the dataset is prepared, pass the dataset_name argument when creating the NNUNetDataset object above.

[6]:
df = pydicer.read_converted_data(working_directory)
df
[6]:
sop_instance_uid hashed_uid modality patient_id series_uid for_uid referenced_sop_instance_uid path
0 1.3.6.1.4.1.14519.5.2.1.7014.4598.263565217029... 88c5ef CT LCTSC-Train-S1-006 1.3.6.1.4.1.14519.5.2.1.7014.4598.770569242984... 1.3.6.1.4.1.14519.5.2.1.7014.4598.325217978839... NaN testdata_lctsc/data/LCTSC-Train-S1-006/images/...
1 1.3.6.1.4.1.14519.5.2.1.7014.4598.161109536301... ceb111 RTSTRUCT LCTSC-Train-S1-006 1.3.6.1.4.1.14519.5.2.1.7014.4598.161109536301... 1.3.6.1.4.1.14519.5.2.1.7014.4598.325217978839... 1.3.6.1.4.1.14519.5.2.1.7014.4598.263565217029... testdata_lctsc/data/LCTSC-Train-S1-006/structu...
2 1.3.6.1.4.1.14519.5.2.1.7014.4598.217118546740... dd0026 CT LCTSC-Train-S1-008 1.3.6.1.4.1.14519.5.2.1.7014.4598.135073567254... 1.3.6.1.4.1.14519.5.2.1.7014.4598.243364093992... NaN testdata_lctsc/data/LCTSC-Train-S1-008/images/...
3 1.3.6.1.4.1.14519.5.2.1.7014.4598.225487254421... 48a970 RTSTRUCT LCTSC-Train-S1-008 1.3.6.1.4.1.14519.5.2.1.7014.4598.225487254421... 1.3.6.1.4.1.14519.5.2.1.7014.4598.243364093992... 1.3.6.1.4.1.14519.5.2.1.7014.4598.217118546740... testdata_lctsc/data/LCTSC-Train-S1-008/structu...
4 1.3.6.1.4.1.14519.5.2.1.7014.4598.235489581364... 914d57 CT LCTSC-Test-S1-102 1.3.6.1.4.1.14519.5.2.1.7014.4598.639871532605... 1.3.6.1.4.1.14519.5.2.1.7014.4598.408067568497... NaN testdata_lctsc/data/LCTSC-Test-S1-102/images/9...
5 1.3.6.1.4.1.14519.5.2.1.7014.4598.110977663386... 6c6ea4 RTSTRUCT LCTSC-Test-S1-102 1.3.6.1.4.1.14519.5.2.1.7014.4598.110977663386... 1.3.6.1.4.1.14519.5.2.1.7014.4598.408067568497... 1.3.6.1.4.1.14519.5.2.1.7014.4598.235489581364... testdata_lctsc/data/LCTSC-Test-S1-102/structur...
6 1.3.6.1.4.1.14519.5.2.1.7014.4598.140943693489... 738c1a CT LCTSC-Train-S1-005 1.3.6.1.4.1.14519.5.2.1.7014.4598.338518041666... 1.3.6.1.4.1.14519.5.2.1.7014.4598.964812114328... NaN testdata_lctsc/data/LCTSC-Train-S1-005/images/...
7 1.3.6.1.4.1.14519.5.2.1.7014.4598.214803404117... 68d663 RTSTRUCT LCTSC-Train-S1-005 1.3.6.1.4.1.14519.5.2.1.7014.4598.214803404117... 1.3.6.1.4.1.14519.5.2.1.7014.4598.964812114328... 1.3.6.1.4.1.14519.5.2.1.7014.4598.140943693489... testdata_lctsc/data/LCTSC-Train-S1-005/structu...
8 1.3.6.1.4.1.14519.5.2.1.7014.4598.141349678572... d91c84 CT LCTSC-Train-S1-004 1.3.6.1.4.1.14519.5.2.1.7014.4598.269433294341... 1.3.6.1.4.1.14519.5.2.1.7014.4598.313160975008... NaN testdata_lctsc/data/LCTSC-Train-S1-004/images/...
9 1.3.6.1.4.1.14519.5.2.1.7014.4598.595315284787... 61758b RTSTRUCT LCTSC-Train-S1-004 1.3.6.1.4.1.14519.5.2.1.7014.4598.595315284787... 1.3.6.1.4.1.14519.5.2.1.7014.4598.313160975008... 1.3.6.1.4.1.14519.5.2.1.7014.4598.141349678572... testdata_lctsc/data/LCTSC-Train-S1-004/structu...
10 1.3.6.1.4.1.14519.5.2.1.7014.4598.188371727865... 6834c9 CT LCTSC-Train-S1-001 1.3.6.1.4.1.14519.5.2.1.7014.4598.330486033168... 1.3.6.1.4.1.14519.5.2.1.7014.4598.109432485688... NaN testdata_lctsc/data/LCTSC-Train-S1-001/images/...
11 1.3.6.1.4.1.14519.5.2.1.7014.4598.267594131248... b5bddb RTSTRUCT LCTSC-Train-S1-001 1.3.6.1.4.1.14519.5.2.1.7014.4598.267594131248... 1.3.6.1.4.1.14519.5.2.1.7014.4598.109432485688... 1.3.6.1.4.1.14519.5.2.1.7014.4598.188371727865... testdata_lctsc/data/LCTSC-Train-S1-001/structu...
12 1.3.6.1.4.1.14519.5.2.1.7014.4598.318848546630... aa38e6 CT LCTSC-Train-S1-002 1.3.6.1.4.1.14519.5.2.1.7014.4598.234842392725... 1.3.6.1.4.1.14519.5.2.1.7014.4598.145984743865... NaN testdata_lctsc/data/LCTSC-Train-S1-002/images/...
13 1.3.6.1.4.1.14519.5.2.1.7014.4598.291449913947... f036b8 RTSTRUCT LCTSC-Train-S1-002 1.3.6.1.4.1.14519.5.2.1.7014.4598.291449913947... 1.3.6.1.4.1.14519.5.2.1.7014.4598.145984743865... 1.3.6.1.4.1.14519.5.2.1.7014.4598.318848546630... testdata_lctsc/data/LCTSC-Train-S1-002/structu...
14 1.3.6.1.4.1.14519.5.2.1.7014.4598.159497309342... 2bf2f9 CT LCTSC-Train-S1-003 1.3.6.1.4.1.14519.5.2.1.7014.4598.742383887179... 1.3.6.1.4.1.14519.5.2.1.7014.4598.803594135292... NaN testdata_lctsc/data/LCTSC-Train-S1-003/images/...
15 1.3.6.1.4.1.14519.5.2.1.7014.4598.127494043495... 8e34f9 RTSTRUCT LCTSC-Train-S1-003 1.3.6.1.4.1.14519.5.2.1.7014.4598.127494043495... 1.3.6.1.4.1.14519.5.2.1.7014.4598.803594135292... 1.3.6.1.4.1.14519.5.2.1.7014.4598.159497309342... testdata_lctsc/data/LCTSC-Train-S1-003/structu...
16 1.3.6.1.4.1.14519.5.2.1.7014.4598.333558838494... 666be6 CT LCTSC-Test-S1-101 1.3.6.1.4.1.14519.5.2.1.7014.4598.106943890850... 1.3.6.1.4.1.14519.5.2.1.7014.4598.171234424242... NaN testdata_lctsc/data/LCTSC-Test-S1-101/images/6...
17 1.3.6.1.4.1.14519.5.2.1.7014.4598.929007819506... cc682f RTSTRUCT LCTSC-Test-S1-101 1.3.6.1.4.1.14519.5.2.1.7014.4598.280355341349... 1.3.6.1.4.1.14519.5.2.1.7014.4598.171234424242... 1.3.6.1.4.1.14519.5.2.1.7014.4598.333558838494... testdata_lctsc/data/LCTSC-Test-S1-101/structur...
18 1.3.6.1.4.1.14519.5.2.1.7014.4598.168227681755... 5adf40 CT LCTSC-Train-S1-007 1.3.6.1.4.1.14519.5.2.1.7014.4598.335375068555... 1.3.6.1.4.1.14519.5.2.1.7014.4598.187709044743... NaN testdata_lctsc/data/LCTSC-Train-S1-007/images/...
19 1.3.6.1.4.1.14519.5.2.1.7014.4598.284338872489... ed6686 RTSTRUCT LCTSC-Train-S1-007 1.3.6.1.4.1.14519.5.2.1.7014.4598.284338872489... 1.3.6.1.4.1.14519.5.2.1.7014.4598.187709044743... 1.3.6.1.4.1.14519.5.2.1.7014.4598.168227681755... testdata_lctsc/data/LCTSC-Train-S1-007/structu...

Check Dataset#

The check_dataset function confirms that we have one image and one structure set per patient in our dataset.

[7]:
nnunet.check_dataset()
pydicer.dataset.nnunet - INFO - Dataset OK

Split Dataset#

Here we randomly split our dataset into a training and testing set. You can specify the training_cases and testing_cases to use in the split_dataset function. If these aren’t supplied, the train_test_split function from sklearn will be used. You can pass keyword arguments to this function via the split_dataset function.

[8]:
nnunet.split_dataset()
pydicer.dataset.nnunet - INFO - Dataset split OK
pydicer.dataset.nnunet - INFO - Training cases: ['LCTSC-Train-S1-006', 'LCTSC-Train-S1-003', 'LCTSC-Test-S1-102', 'LCTSC-Train-S1-007', 'LCTSC-Train-S1-004', 'LCTSC-Train-S1-005', 'LCTSC-Train-S1-002']
pydicer.dataset.nnunet - INFO - Testing cases: ['LCTSC-Test-S1-101', 'LCTSC-Train-S1-008', 'LCTSC-Train-S1-001']

Check for Duplicate Data#

Now that the dataset is split, we must ensure that none of the training_cases are present in the testing_cases. Even if the cases have different IDs, it is possible that through anonymisation the same patient is anonymised to two different IDs. The check_duplicates_train_test function will check the imaging data to ensure there are no duplicates.

[9]:
nnunet.check_duplicates_train_test()
pydicer.dataset.nnunet - INFO - No duplicate images found in training and testing sets

Check Structure Names#

The nnUNet requires that all structures are present for all cases (missing structures are not supported). The check_structure_names function will output a grid indicating where structures might be missing (or a structure name mapping is missing).

If there are any cases for which any of the structures are missing, this should be resolved (by adding a structure mapping or remove the case from the dataset) before proceeding.

[10]:
df_results = nnunet.check_structure_names()
df_results
[10]:
  patient_id struct_hash Esophagus Heart Lung_L Lung_R SpinalCord
0 LCTSC-Train-S1-006 ceb111 1 1 1 1 1
1 LCTSC-Train-S1-008 48a970 1 1 1 1 1
2 LCTSC-Test-S1-102 6c6ea4 1 1 1 1 1
3 LCTSC-Train-S1-005 68d663 1 1 1 1 1
4 LCTSC-Train-S1-004 61758b 1 1 1 1 1
5 LCTSC-Train-S1-001 b5bddb 1 1 1 1 1
6 LCTSC-Train-S1-002 f036b8 1 1 1 1 1
7 LCTSC-Train-S1-003 8e34f9 1 1 1 1 1
8 LCTSC-Test-S1-101 cc682f 1 1 1 1 1
9 LCTSC-Train-S1-007 ed6686 1 1 1 1 1

Check for Overlapping Structures#

nnUNet (v1) is unable to handle structures which are overlapping. If there are structures which are overlapping, the PyDicer tool will assign the overlapping voxels to the smaller structure (to assign to the larger structure, set nnunet.assign_overlap_to_largest=False).

The check_overlapping_structures function will log any structures which are overlapping and will be affected by this rule.

[11]:
nnunet.check_overlapping_structures()
Esophagus overlaps with Heart for patient LCTSC-Train-S1-006 structure set ceb111
Esophagus overlaps with Lung_R for patient LCTSC-Train-S1-006 structure set ceb111
Heart overlaps with Lung_L for patient LCTSC-Train-S1-006 structure set ceb111
Heart overlaps with Lung_R for patient LCTSC-Train-S1-006 structure set ceb111
Lung_L overlaps with Lung_R for patient LCTSC-Train-S1-006 structure set ceb111
Esophagus overlaps with Heart for patient LCTSC-Train-S1-008 structure set 48a970
Heart overlaps with Lung_L for patient LCTSC-Train-S1-008 structure set 48a970
Heart overlaps with Lung_R for patient LCTSC-Train-S1-008 structure set 48a970
Esophagus overlaps with Heart for patient LCTSC-Test-S1-102 structure set 6c6ea4
Esophagus overlaps with Lung_L for patient LCTSC-Test-S1-102 structure set 6c6ea4
Esophagus overlaps with Lung_R for patient LCTSC-Test-S1-102 structure set 6c6ea4
Heart overlaps with Lung_L for patient LCTSC-Test-S1-102 structure set 6c6ea4
Heart overlaps with Lung_R for patient LCTSC-Test-S1-102 structure set 6c6ea4
Heart overlaps with Lung_L for patient LCTSC-Train-S1-005 structure set 68d663
Heart overlaps with Lung_R for patient LCTSC-Train-S1-005 structure set 68d663
Lung_L overlaps with Lung_R for patient LCTSC-Train-S1-005 structure set 68d663
Esophagus overlaps with Heart for patient LCTSC-Train-S1-004 structure set 61758b
Esophagus overlaps with Lung_L for patient LCTSC-Train-S1-004 structure set 61758b
Esophagus overlaps with Lung_R for patient LCTSC-Train-S1-004 structure set 61758b
Heart overlaps with Lung_L for patient LCTSC-Train-S1-004 structure set 61758b
Heart overlaps with Lung_R for patient LCTSC-Train-S1-004 structure set 61758b
Lung_L overlaps with Lung_R for patient LCTSC-Train-S1-004 structure set 61758b
Esophagus overlaps with Heart for patient LCTSC-Train-S1-001 structure set b5bddb
Esophagus overlaps with Lung_R for patient LCTSC-Train-S1-001 structure set b5bddb
Heart overlaps with Lung_L for patient LCTSC-Train-S1-001 structure set b5bddb
Heart overlaps with Lung_R for patient LCTSC-Train-S1-001 structure set b5bddb
Esophagus overlaps with Lung_R for patient LCTSC-Train-S1-002 structure set f036b8
Heart overlaps with Lung_L for patient LCTSC-Train-S1-002 structure set f036b8
Heart overlaps with Lung_R for patient LCTSC-Train-S1-002 structure set f036b8
Lung_L overlaps with Lung_R for patient LCTSC-Train-S1-002 structure set f036b8
Esophagus overlaps with Heart for patient LCTSC-Train-S1-003 structure set 8e34f9
Esophagus overlaps with Lung_L for patient LCTSC-Train-S1-003 structure set 8e34f9
Esophagus overlaps with Lung_R for patient LCTSC-Train-S1-003 structure set 8e34f9
Heart overlaps with Lung_L for patient LCTSC-Train-S1-003 structure set 8e34f9
Heart overlaps with Lung_R for patient LCTSC-Train-S1-003 structure set 8e34f9
Lung_L overlaps with Lung_R for patient LCTSC-Train-S1-003 structure set 8e34f9
Esophagus overlaps with Heart for patient LCTSC-Test-S1-101 structure set cc682f
Heart overlaps with Lung_L for patient LCTSC-Test-S1-101 structure set cc682f
Heart overlaps with Lung_R for patient LCTSC-Test-S1-101 structure set cc682f
Esophagus overlaps with Heart for patient LCTSC-Train-S1-007 structure set ed6686
Esophagus overlaps with Lung_L for patient LCTSC-Train-S1-007 structure set ed6686
Esophagus overlaps with Lung_R for patient LCTSC-Train-S1-007 structure set ed6686
Heart overlaps with Lung_L for patient LCTSC-Train-S1-007 structure set ed6686
Heart overlaps with Lung_R for patient LCTSC-Train-S1-007 structure set ed6686
pydicer.dataset.nnunet - WARNING - Overlapping structures were detected

Prepare nnUNet Dataset#

Now that all checks are complete, we can proceed with preparing the nnUNet dataset. Take a look in the dataset directory after the cell finishes running to confirm that everything worked as expected.

[12]:
nnunet_dataset_path = nnunet.prepare_dataset()
print(f"Dataset prepared in: {nnunet_dataset_path}")
Dataset prepared in: nnScratch/nnUNet_raw_data/Task123_LCTSC_Test

Prepare nnUNet Training Scripts#

Consult the nnUNet documentation for information on model training. The generate_training_scripts may help prepare a script useful for training the nnUNet models for the dataset which was prepared.

[13]:
# Add some additional commands at the top of the script (useful for activating a virtual
# environemnt)
script_header = [
    '# source /path/to/venv/bin/activate',
]

script_path = nnunet.generate_training_scripts(script_header=script_header)
print(f"Training script ready in: {script_path}")

# Set the logging verbosity back to NOTSET
pydicer.set_verbosity(logging.NOTSET)
Training script ready in: train_123_LCTSC_Test.sh
[ ]: