Dataset Preparation#

Open In Colab

When working with real-world DICOM datasets, you will often need to tackle the task of cleaning the dataset. Often you will have several image series, structure set and even dose grids for each patient. However you typically want to select one relevant DICOM object in each category.

To help solve this, PyDicer provides a dataset preparation module which can be used to extract a subset of data from your overall set. Two example use cases where this might be useful are:

  • Analysing dose to structures for a radiotherapy treatment: You will want to extract the dose grid which was calculated from the plan used to treat the patient, as well as the linked structure set and planning CT image.

  • Validating an Auto-segmentation tool: A structure set may have been prepared for the purposes of validation and saved off with a specific SeriesDescription. You select the latest structure set with that description as well as the linked image series to perform the auto-segmentation validation.

As you will see in the examples below, you can also provide your own logic to extract subsets of data using PyDicer.

[1]:
try:
    from pydicer import PyDicer
except ImportError:
    !pip install pydicer
    from pydicer import PyDicer

from pathlib import Path
import pandas as pd

from pydicer.utils import fetch_converted_test_data

Setup PyDicer#

As in some other examples, we will use the HNSCC data prepared which has been preprepared and is downloaded into the testdata_hnscc directory. We also setup our PyDicer object.

[2]:
working_directory = fetch_converted_test_data("./testdata_hnscc", dataset="HNSCC")

pydicer = PyDicer(working_directory)

Explore data#

When we use the read_converted_data function, by default it will return all data which has been converted and is stored in the testdata_hnscc/data directory.

Let’s use this function and output the entire DataFrame of converted data to see what we have available in this dataset.

[3]:
df = pydicer.read_converted_data()
df
[3]:
sop_instance_uid hashed_uid modality patient_id series_uid for_uid referenced_sop_instance_uid path
0 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... 72b0f9 CT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... NaN testdata_hnscc/data/HNSCC-01-0199/images/72b0f9
1 1.3.6.1.4.1.14519.5.2.1.1706.8040.264264397186... c16e76 RTDOSE HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.233527028792... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... testdata_hnscc/data/HNSCC-01-0199/doses/c16e76
2 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... 664e96 RTPLAN HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.137463901488... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... testdata_hnscc/data/HNSCC-01-0199/plans/664e96
3 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... 06e49c RTSTRUCT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... testdata_hnscc/data/HNSCC-01-0199/structures/0...
4 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... c4ffd0 CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... NaN testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0
5 1.3.6.1.4.1.14519.5.2.1.1706.8040.107072817915... 8e0da9 CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.176143398282... 1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702... NaN testdata_hnscc/data/HNSCC-01-0176/images/8e0da9
6 1.3.6.1.4.1.14519.5.2.1.1706.8040.133948865586... ec4aec CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.192899726585... 1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702... NaN testdata_hnscc/data/HNSCC-01-0176/images/ec4aec
7 1.3.6.1.4.1.14519.5.2.1.1706.8040.469610481459... 33c44a CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.244362210503... 1.3.6.1.4.1.14519.5.2.1.1706.8040.310630617866... NaN testdata_hnscc/data/HNSCC-01-0176/images/33c44a
8 1.3.6.1.4.1.14519.5.2.1.1706.8040.169033525924... 833a74 RTDOSE HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.279793773343... 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... testdata_hnscc/data/HNSCC-01-0176/doses/833a74
9 1.3.6.1.4.1.14519.5.2.1.1706.8040.267291308489... bf3fba RTDOSE HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.283706688235... 1.3.6.1.4.1.14519.5.2.1.1706.8040.566662631858... 1.3.6.1.4.1.14519.5.2.1.1706.8040.173917268454... testdata_hnscc/data/HNSCC-01-0176/doses/bf3fba
10 1.3.6.1.4.1.14519.5.2.1.1706.8040.173917268454... 6f7db7 RTPLAN HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.120111576192... 1.3.6.1.4.1.14519.5.2.1.1706.8040.566662631858... 1.3.6.1.4.1.14519.5.2.1.1706.8040.323156708629... testdata_hnscc/data/HNSCC-01-0176/plans/6f7db7
11 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... a6b346 RTPLAN HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.318927873561... 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... testdata_hnscc/data/HNSCC-01-0176/plans/a6b346
12 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... cbbf5b RTSTRUCT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.276897558084... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... testdata_hnscc/data/HNSCC-01-0176/structures/c...
13 1.3.6.1.4.1.14519.5.2.1.1706.8040.323156708629... 6d2934 RTSTRUCT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.495627765798... 1.3.6.1.4.1.14519.5.2.1.1706.8040.310630617866... 1.3.6.1.4.1.14519.5.2.1.1706.8040.469610481459... testdata_hnscc/data/HNSCC-01-0176/structures/6...
14 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... b281ea CT HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... NaN testdata_hnscc/data/HNSCC-01-0019/images/b281ea
15 1.3.6.1.4.1.14519.5.2.1.1706.8040.242809596262... 309e1a RTDOSE HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.777975715563... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... 1.3.6.1.4.1.14519.5.2.1.1706.8040.254865609982... testdata_hnscc/data/HNSCC-01-0019/doses/309e1a
16 1.3.6.1.4.1.14519.5.2.1.1706.8040.254865609982... 57b99f RTPLAN HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.202542618630... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... testdata_hnscc/data/HNSCC-01-0019/plans/57b99f
17 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... 7cdcd9 RTSTRUCT HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.103450757970... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... testdata_hnscc/data/HNSCC-01-0019/structures/7...

Prepare dose data#

Here we use the dataset preparation module to extract the latest dose grid by date along with the linked structure sets and planning image series. We refer to this subset of data as dose_project.

We use the built in data extraction function, named rt_latest_dose.

[4]:
dose_project_name = "dose_project"
pydicer.dataset.prepare(dose_project_name, "rt_latest_dose")

Once the cell above has finished running, the dataset has been prepared. You can explore the dataset in the testdata_hnscc/dose_project directory. Take notice of two things: - The converted.csv file stored from each patient now only includes the data objects which have been selected as part of this subset of data. - The data object folders are not actual folders, but symbolic links to the original data found in the testdata_hnscc/data directory. Like this, data isn’t duplicated but the folder structure remains easy to navigate.

Note: Symbolic links are supported on Unix-based (Linux, MacOS) operating systems only. These won’t work on Windows however you can still use the dataset prepared which is tracked in the converted csv files.

Load prepared Dataset#

By supplying the dataset_name to the read_converted_data function, we obtain a DataFrame containing only the data objects part of that subset.

[5]:
df_dose_project = pydicer.read_converted_data(dataset_name=dose_project_name)
df_dose_project
[5]:
sop_instance_uid hashed_uid modality patient_id series_uid for_uid referenced_sop_instance_uid path
0 1.3.6.1.4.1.14519.5.2.1.1706.8040.264264397186... c16e76 RTDOSE HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.233527028792... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... testdata_hnscc/data/HNSCC-01-0199/doses/c16e76
1 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... 664e96 RTPLAN HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.137463901488... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... testdata_hnscc/data/HNSCC-01-0199/plans/664e96
2 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... 06e49c RTSTRUCT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... testdata_hnscc/data/HNSCC-01-0199/structures/0...
3 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... 72b0f9 CT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... NaN testdata_hnscc/data/HNSCC-01-0199/images/72b0f9
4 1.3.6.1.4.1.14519.5.2.1.1706.8040.169033525924... 833a74 RTDOSE HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.279793773343... 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... testdata_hnscc/data/HNSCC-01-0176/doses/833a74
5 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... a6b346 RTPLAN HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.318927873561... 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... testdata_hnscc/data/HNSCC-01-0176/plans/a6b346
6 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... cbbf5b RTSTRUCT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.276897558084... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... testdata_hnscc/data/HNSCC-01-0176/structures/c...
7 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... c4ffd0 CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... NaN testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0
8 1.3.6.1.4.1.14519.5.2.1.1706.8040.242809596262... 309e1a RTDOSE HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.777975715563... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... 1.3.6.1.4.1.14519.5.2.1.1706.8040.254865609982... testdata_hnscc/data/HNSCC-01-0019/doses/309e1a
9 1.3.6.1.4.1.14519.5.2.1.1706.8040.254865609982... 57b99f RTPLAN HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.202542618630... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... testdata_hnscc/data/HNSCC-01-0019/plans/57b99f
10 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... 7cdcd9 RTSTRUCT HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.103450757970... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... testdata_hnscc/data/HNSCC-01-0019/structures/7...
11 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... b281ea CT HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... NaN testdata_hnscc/data/HNSCC-01-0019/images/b281ea

Notice that we now only have one of each data object modality in our dose_project subset. We are now ready to work with that subset (e.g. extract dose metrics).

Prepare Structure Dataset#

In the next example, we are only want to extract structure sets and their associated images. This might be useful when training or validating an auto-segmentation model.

In this example, we not only select the latest structure set by date, but we specify the StudyDescription values of the DICOM metadata of the data objects we want to select. To achieve this, we use the build in rt_latest_struct function which will also extract the image series linked to the structure set selected.

Observe the output of the following cell and explore the testdata_hnscc directory. We not have one structure set and the linked image for each patient.

[6]:
# Define the dataset name and the study description values to match
structure_project_name = "structure_project"
series_descriptions = [
    "RT SIMULATION"
]


# Prepare the subset of data
pydicer.dataset.prepare(
    structure_project_name,
    "rt_latest_struct",
    StudyDescription=series_descriptions
)

# Load the data subset and display the DataFrame
df_structure_project = pydicer.read_converted_data(dataset_name=structure_project_name)
df_structure_project
[6]:
sop_instance_uid hashed_uid modality patient_id series_uid for_uid referenced_sop_instance_uid path
0 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... 06e49c RTSTRUCT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... testdata_hnscc/data/HNSCC-01-0199/structures/0...
1 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... 72b0f9 CT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... NaN testdata_hnscc/data/HNSCC-01-0199/images/72b0f9
2 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... cbbf5b RTSTRUCT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.276897558084... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... testdata_hnscc/data/HNSCC-01-0176/structures/c...
3 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... c4ffd0 CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... NaN testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0
4 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... 7cdcd9 RTSTRUCT HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.103450757970... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... testdata_hnscc/data/HNSCC-01-0019/structures/7...
5 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... b281ea CT HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... NaN testdata_hnscc/data/HNSCC-01-0019/images/b281ea

Prepare Dataset from DataFrame#

In some scenarios, you may want to simply perform some filtering on the DataFrame returned by the read_converted_data function and generate a subset of data based on that.

In the following cell, a subset of data named image_project is generated by filtering the DataFrame to keep only CT images.

After running the following cell, explore the testdata_hnscc/image_project directory to confirm that only image objects were selected.

[7]:
# Read the converted DataFrame and filter only CT images
df = pydicer.read_converted_data()
df_ct = df[df.modality=="CT"]

# Prepare a data subset using this filtered DataFrame
image_project_name = "image_project"
pydicer.dataset.prepare_from_dataframe(image_project_name, df_ct)

# Load the data subset and display the DataFrame
df_image_project = pydicer.read_converted_data(dataset_name=image_project_name)
df_image_project

[7]:
sop_instance_uid hashed_uid modality patient_id series_uid for_uid referenced_sop_instance_uid path
0 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... 72b0f9 CT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... NaN testdata_hnscc/data/HNSCC-01-0199/images/72b0f9
1 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... c4ffd0 CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... NaN testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0
2 1.3.6.1.4.1.14519.5.2.1.1706.8040.107072817915... 8e0da9 CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.176143398282... 1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702... NaN testdata_hnscc/data/HNSCC-01-0176/images/8e0da9
3 1.3.6.1.4.1.14519.5.2.1.1706.8040.133948865586... ec4aec CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.192899726585... 1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702... NaN testdata_hnscc/data/HNSCC-01-0176/images/ec4aec
4 1.3.6.1.4.1.14519.5.2.1.1706.8040.469610481459... 33c44a CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.244362210503... 1.3.6.1.4.1.14519.5.2.1.1706.8040.310630617866... NaN testdata_hnscc/data/HNSCC-01-0176/images/33c44a
5 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... b281ea CT HNSCC-01-0019 1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938... 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... NaN testdata_hnscc/data/HNSCC-01-0019/images/b281ea

Define Custom Preparation Function#

In more complex use cases you may want to define your own logic for extracting data objects into a subset. For example, you may have an additional DataFrame containing treatment start dates of patients, and you would like to select the dose grid, structure set and image series which are closest to that date.

In the following cell, we preare a clinical_project subset of data. We create a dummy set of clinical tabular data df_clinical. This stores each patient’s stage and RT start date.

We use the information in df_clinical, to select patients who are stage 1-3 along with the data objects where the dose grid date is nearest to their treatment start date.

[8]:
# Define some dummy clinical data
df_clinical = pd.DataFrame([
    {
        "patient_id": "HNSCC-01-0199",
        "stage": 2,
        "rt_start_date": "2002-10-28",
    },
    {
        "patient_id": "HNSCC-01-0176",
        "stage": 1,
        "rt_start_date": "2009-03-02",
    },
    {
        "patient_id": "HNSCC-01-0019",
        "stage": 4,
        "rt_start_date": "1998-07-10",
    },
])

# Convert date to a datetime object
df_clinical['rt_start_date'] = pd.to_datetime(df_clinical['rt_start_date'], format='%Y-%m-%d')
[9]:
# Import some pydicer utility functions that we'll need
from pydicer.utils import load_object_metadata, determine_dcm_datetime

# Define a function which accept the converted DataFrame as input and returns a filtered DataFrame
# of objects to keep in the data subset. This function also takes the clinical DataFrame as input.
def extract_clinical_data(df_data, df_clinical):

    # Merge the clinical data with our data objects
    df = pd.merge(df_data, df_clinical, on="patient_id", how="outer")

    # Filter out patients who aren't stage 1-3
    df = df[(df.stage >= 1) & (df.stage <= 3)]

    # Determine the date of each data object
    df["obj_date"] = df.apply(lambda row: determine_dcm_datetime(load_object_metadata(row)), axis=1)

    # List to track row indicies we will keep
    keep_rows = []

    # Sort their dose grids by descending order, so we can select the first (latest)
    # dose grid and link the structure set and image series to use for the data subset.
    df = df.sort_values("obj_date", ascending=False)

    # Loop the data by patient to select the data objects
    for patient_id, df_pat in df.groupby("patient_id"):

        df_doses = df_pat[df_pat.modality=="RTDOSE"]

        # If there are no dose grid, we skip this patient
        if len(df_doses) == 0:
            continue

        # Otherwise, we select the first dose grid (which is the latest since they are sorted)
        # to keep
        dose_row = df_doses.iloc[0]

        df_linked_structs = pydicer.get_structures_linked_to_dose(dose_row)

        # Skip patient if no linked structure sets are found
        if len(df_linked_structs) == 0:
            continue

        # Finally, find the image linked to the structure set
        struct_row = df_linked_structs.iloc[0]

        df_linked_images = df[df.sop_instance_uid==struct_row.referenced_sop_instance_uid]

        # Skip if no images found
        if len(df_linked_images) == 0:
            continue

        image_row = df_linked_images.iloc[0]

        # Store the indcies of these data objects
        keep_rows.append(image_row.name)
        keep_rows.append(struct_row.name)
        keep_rows.append(dose_row.name)

    # Return only the rows of the data objects we want to keep in the data subset
    return df_data.loc[keep_rows]
[10]:
clinical_project_name = "clinical"

# Prepare the subset of data using our custom function
pydicer.dataset.prepare(
    clinical_project_name,
    extract_clinical_data,
    df_clinical=df_clinical
)

# Load the data subset and display the DataFrame
df_clinical_project = pydicer.read_converted_data(dataset_name=clinical_project_name)
df_clinical_project
[10]:
sop_instance_uid hashed_uid modality patient_id series_uid for_uid referenced_sop_instance_uid path
0 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... 72b0f9 CT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... NaN testdata_hnscc/data/HNSCC-01-0199/images/72b0f9
1 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... 06e49c RTSTRUCT HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... testdata_hnscc/data/HNSCC-01-0199/structures/0...
2 1.3.6.1.4.1.14519.5.2.1.1706.8040.264264397186... c16e76 RTDOSE HNSCC-01-0199 1.3.6.1.4.1.14519.5.2.1.1706.8040.233527028792... 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... testdata_hnscc/data/HNSCC-01-0199/doses/c16e76
3 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... c4ffd0 CT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... NaN testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0
4 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... cbbf5b RTSTRUCT HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.276897558084... 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... testdata_hnscc/data/HNSCC-01-0176/structures/c...
5 1.3.6.1.4.1.14519.5.2.1.1706.8040.169033525924... 833a74 RTDOSE HNSCC-01-0176 1.3.6.1.4.1.14519.5.2.1.1706.8040.279793773343... 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... testdata_hnscc/data/HNSCC-01-0176/doses/833a74
[ ]: