Dataset Preparation#
When working with real-world DICOM datasets, you will often need to tackle the task of cleaning the dataset. Often you will have several image series, structure set and even dose grids for each patient. However you typically want to select one relevant DICOM object in each category.
To help solve this, PyDicer provides a dataset preparation module which can be used to extract a subset of data from your overall set. Two example use cases where this might be useful are:
Analysing dose to structures for a radiotherapy treatment: You will want to extract the dose grid which was calculated from the plan used to treat the patient, as well as the linked structure set and planning CT image.
Validating an Auto-segmentation tool: A structure set may have been prepared for the purposes of validation and saved off with a specific
SeriesDescription
. You select the latest structure set with that description as well as the linked image series to perform the auto-segmentation validation.
As you will see in the examples below, you can also provide your own logic to extract subsets of data using PyDicer.
[1]:
try:
from pydicer import PyDicer
except ImportError:
!pip install pydicer
from pydicer import PyDicer
from pathlib import Path
import pandas as pd
from pydicer.utils import fetch_converted_test_data
Setup PyDicer#
As in some other examples, we will use the HNSCC data prepared which has been preprepared and is downloaded into the testdata_hnscc
directory. We also setup our PyDicer
object.
[2]:
working_directory = fetch_converted_test_data("./testdata_hnscc", dataset="HNSCC")
pydicer = PyDicer(working_directory)
Explore data#
When we use the read_converted_data function, by default it will return all data which has been converted and is stored in the testdata_hnscc/data
directory.
Let’s use this function and output the entire DataFrame of converted data to see what we have available in this dataset.
[3]:
df = pydicer.read_converted_data()
df
[3]:
sop_instance_uid | hashed_uid | modality | patient_id | series_uid | for_uid | referenced_sop_instance_uid | path | |
---|---|---|---|---|---|---|---|---|
0 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | 72b0f9 | CT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | NaN | testdata_hnscc/data/HNSCC-01-0199/images/72b0f9 |
1 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.264264397186... | c16e76 | RTDOSE | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.233527028792... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... | testdata_hnscc/data/HNSCC-01-0199/doses/c16e76 |
2 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... | 664e96 | RTPLAN | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.137463901488... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... | testdata_hnscc/data/HNSCC-01-0199/plans/664e96 |
3 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... | 06e49c | RTSTRUCT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | testdata_hnscc/data/HNSCC-01-0199/structures/0... |
4 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | c4ffd0 | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0 |
5 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.107072817915... | 8e0da9 | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.176143398282... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/8e0da9 |
6 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.133948865586... | ec4aec | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.192899726585... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/ec4aec |
7 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.469610481459... | 33c44a | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.244362210503... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.310630617866... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/33c44a |
8 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.169033525924... | 833a74 | RTDOSE | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.279793773343... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... | testdata_hnscc/data/HNSCC-01-0176/doses/833a74 |
9 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.267291308489... | bf3fba | RTDOSE | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.283706688235... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.566662631858... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.173917268454... | testdata_hnscc/data/HNSCC-01-0176/doses/bf3fba |
10 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.173917268454... | 6f7db7 | RTPLAN | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120111576192... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.566662631858... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.323156708629... | testdata_hnscc/data/HNSCC-01-0176/plans/6f7db7 |
11 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... | a6b346 | RTPLAN | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.318927873561... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... | testdata_hnscc/data/HNSCC-01-0176/plans/a6b346 |
12 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... | cbbf5b | RTSTRUCT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.276897558084... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | testdata_hnscc/data/HNSCC-01-0176/structures/c... |
13 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.323156708629... | 6d2934 | RTSTRUCT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.495627765798... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.310630617866... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.469610481459... | testdata_hnscc/data/HNSCC-01-0176/structures/6... |
14 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... | b281ea | CT | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | NaN | testdata_hnscc/data/HNSCC-01-0019/images/b281ea |
15 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.242809596262... | 309e1a | RTDOSE | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.777975715563... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.254865609982... | testdata_hnscc/data/HNSCC-01-0019/doses/309e1a |
16 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.254865609982... | 57b99f | RTPLAN | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.202542618630... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... | testdata_hnscc/data/HNSCC-01-0019/plans/57b99f |
17 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... | 7cdcd9 | RTSTRUCT | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.103450757970... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... | testdata_hnscc/data/HNSCC-01-0019/structures/7... |
Prepare dose data#
Here we use the dataset preparation module to extract the latest dose grid by date along with the linked structure sets and planning image series. We refer to this subset of data as dose_project
.
We use the built in data extraction function, named rt_latest_dose.
[4]:
dose_project_name = "dose_project"
pydicer.dataset.prepare(dose_project_name, "rt_latest_dose")
Once the cell above has finished running, the dataset has been prepared. You can explore the dataset in the testdata_hnscc/dose_project
directory. Take notice of two things: - The converted.csv
file stored from each patient now only includes the data objects which have been selected as part of this subset of data. - The data object folders are not actual folders, but symbolic links to the original data found in the testdata_hnscc/data
directory. Like this, data isn’t duplicated but
the folder structure remains easy to navigate.
Note: Symbolic links are supported on Unix-based (Linux, MacOS) operating systems only. These won’t work on Windows however you can still use the dataset prepared which is tracked in the converted csv files.
Load prepared Dataset#
By supplying the dataset_name
to the read_converted_data
function, we obtain a DataFrame containing only the data objects part of that subset.
[5]:
df_dose_project = pydicer.read_converted_data(dataset_name=dose_project_name)
df_dose_project
[5]:
sop_instance_uid | hashed_uid | modality | patient_id | series_uid | for_uid | referenced_sop_instance_uid | path | |
---|---|---|---|---|---|---|---|---|
0 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.264264397186... | c16e76 | RTDOSE | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.233527028792... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... | testdata_hnscc/data/HNSCC-01-0199/doses/c16e76 |
1 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... | 664e96 | RTPLAN | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.137463901488... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... | testdata_hnscc/data/HNSCC-01-0199/plans/664e96 |
2 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... | 06e49c | RTSTRUCT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | testdata_hnscc/data/HNSCC-01-0199/structures/0... |
3 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | 72b0f9 | CT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | NaN | testdata_hnscc/data/HNSCC-01-0199/images/72b0f9 |
4 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.169033525924... | 833a74 | RTDOSE | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.279793773343... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... | testdata_hnscc/data/HNSCC-01-0176/doses/833a74 |
5 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... | a6b346 | RTPLAN | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.318927873561... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... | testdata_hnscc/data/HNSCC-01-0176/plans/a6b346 |
6 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... | cbbf5b | RTSTRUCT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.276897558084... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | testdata_hnscc/data/HNSCC-01-0176/structures/c... |
7 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | c4ffd0 | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0 |
8 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.242809596262... | 309e1a | RTDOSE | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.777975715563... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.254865609982... | testdata_hnscc/data/HNSCC-01-0019/doses/309e1a |
9 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.254865609982... | 57b99f | RTPLAN | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.202542618630... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... | testdata_hnscc/data/HNSCC-01-0019/plans/57b99f |
10 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... | 7cdcd9 | RTSTRUCT | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.103450757970... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... | testdata_hnscc/data/HNSCC-01-0019/structures/7... |
11 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... | b281ea | CT | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | NaN | testdata_hnscc/data/HNSCC-01-0019/images/b281ea |
Notice that we now only have one of each data object modality in our dose_project
subset. We are now ready to work with that subset (e.g. extract dose metrics).
Prepare Structure Dataset#
In the next example, we are only want to extract structure sets and their associated images. This might be useful when training or validating an auto-segmentation model.
In this example, we not only select the latest structure set by date, but we specify the StudyDescription
values of the DICOM metadata of the data objects we want to select. To achieve this, we use the build in rt_latest_struct function which will also extract the image series linked to the structure set selected.
Observe the output of the following cell and explore the testdata_hnscc
directory. We not have one structure set and the linked image for each patient.
[6]:
# Define the dataset name and the study description values to match
structure_project_name = "structure_project"
series_descriptions = [
"RT SIMULATION"
]
# Prepare the subset of data
pydicer.dataset.prepare(
structure_project_name,
"rt_latest_struct",
StudyDescription=series_descriptions
)
# Load the data subset and display the DataFrame
df_structure_project = pydicer.read_converted_data(dataset_name=structure_project_name)
df_structure_project
[6]:
sop_instance_uid | hashed_uid | modality | patient_id | series_uid | for_uid | referenced_sop_instance_uid | path | |
---|---|---|---|---|---|---|---|---|
0 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... | 06e49c | RTSTRUCT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | testdata_hnscc/data/HNSCC-01-0199/structures/0... |
1 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | 72b0f9 | CT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | NaN | testdata_hnscc/data/HNSCC-01-0199/images/72b0f9 |
2 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... | cbbf5b | RTSTRUCT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.276897558084... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | testdata_hnscc/data/HNSCC-01-0176/structures/c... |
3 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | c4ffd0 | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0 |
4 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.168221415040... | 7cdcd9 | RTSTRUCT | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.103450757970... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... | testdata_hnscc/data/HNSCC-01-0019/structures/7... |
5 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... | b281ea | CT | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | NaN | testdata_hnscc/data/HNSCC-01-0019/images/b281ea |
Prepare Dataset from DataFrame#
In some scenarios, you may want to simply perform some filtering on the DataFrame returned by the read_converted_data
function and generate a subset of data based on that.
In the following cell, a subset of data named image_project
is generated by filtering the DataFrame to keep only CT
images.
After running the following cell, explore the testdata_hnscc/image_project
directory to confirm that only image objects were selected.
[7]:
# Read the converted DataFrame and filter only CT images
df = pydicer.read_converted_data()
df_ct = df[df.modality=="CT"]
# Prepare a data subset using this filtered DataFrame
image_project_name = "image_project"
pydicer.dataset.prepare_from_dataframe(image_project_name, df_ct)
# Load the data subset and display the DataFrame
df_image_project = pydicer.read_converted_data(dataset_name=image_project_name)
df_image_project
[7]:
sop_instance_uid | hashed_uid | modality | patient_id | series_uid | for_uid | referenced_sop_instance_uid | path | |
---|---|---|---|---|---|---|---|---|
0 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | 72b0f9 | CT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | NaN | testdata_hnscc/data/HNSCC-01-0199/images/72b0f9 |
1 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | c4ffd0 | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0 |
2 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.107072817915... | 8e0da9 | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.176143398282... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/8e0da9 |
3 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.133948865586... | ec4aec | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.192899726585... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/ec4aec |
4 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.469610481459... | 33c44a | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.244362210503... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.310630617866... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/33c44a |
5 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763... | b281ea | CT | HNSCC-01-0019 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603... | NaN | testdata_hnscc/data/HNSCC-01-0019/images/b281ea |
Define Custom Preparation Function#
In more complex use cases you may want to define your own logic for extracting data objects into a subset. For example, you may have an additional DataFrame containing treatment start dates of patients, and you would like to select the dose grid, structure set and image series which are closest to that date.
In the following cell, we preare a clinical_project
subset of data. We create a dummy set of clinical tabular data df_clinical
. This stores each patient’s stage and RT start date.
We use the information in df_clinical
, to select patients who are stage 1-3 along with the data objects where the dose grid date is nearest to their treatment start date.
[8]:
# Define some dummy clinical data
df_clinical = pd.DataFrame([
{
"patient_id": "HNSCC-01-0199",
"stage": 2,
"rt_start_date": "2002-10-28",
},
{
"patient_id": "HNSCC-01-0176",
"stage": 1,
"rt_start_date": "2009-03-02",
},
{
"patient_id": "HNSCC-01-0019",
"stage": 4,
"rt_start_date": "1998-07-10",
},
])
# Convert date to a datetime object
df_clinical['rt_start_date'] = pd.to_datetime(df_clinical['rt_start_date'], format='%Y-%m-%d')
[9]:
# Import some pydicer utility functions that we'll need
from pydicer.utils import load_object_metadata, determine_dcm_datetime
# Define a function which accept the converted DataFrame as input and returns a filtered DataFrame
# of objects to keep in the data subset. This function also takes the clinical DataFrame as input.
def extract_clinical_data(df_data, df_clinical):
# Merge the clinical data with our data objects
df = pd.merge(df_data, df_clinical, on="patient_id", how="outer")
# Filter out patients who aren't stage 1-3
df = df[(df.stage >= 1) & (df.stage <= 3)]
# Determine the date of each data object
df["obj_date"] = df.apply(lambda row: determine_dcm_datetime(load_object_metadata(row)), axis=1)
# List to track row indicies we will keep
keep_rows = []
# Sort their dose grids by descending order, so we can select the first (latest)
# dose grid and link the structure set and image series to use for the data subset.
df = df.sort_values("obj_date", ascending=False)
# Loop the data by patient to select the data objects
for patient_id, df_pat in df.groupby("patient_id"):
df_doses = df_pat[df_pat.modality=="RTDOSE"]
# If there are no dose grid, we skip this patient
if len(df_doses) == 0:
continue
# Otherwise, we select the first dose grid (which is the latest since they are sorted)
# to keep
dose_row = df_doses.iloc[0]
df_linked_structs = pydicer.get_structures_linked_to_dose(dose_row)
# Skip patient if no linked structure sets are found
if len(df_linked_structs) == 0:
continue
# Finally, find the image linked to the structure set
struct_row = df_linked_structs.iloc[0]
df_linked_images = df[df.sop_instance_uid==struct_row.referenced_sop_instance_uid]
# Skip if no images found
if len(df_linked_images) == 0:
continue
image_row = df_linked_images.iloc[0]
# Store the indcies of these data objects
keep_rows.append(image_row.name)
keep_rows.append(struct_row.name)
keep_rows.append(dose_row.name)
# Return only the rows of the data objects we want to keep in the data subset
return df_data.loc[keep_rows]
[10]:
clinical_project_name = "clinical"
# Prepare the subset of data using our custom function
pydicer.dataset.prepare(
clinical_project_name,
extract_clinical_data,
df_clinical=df_clinical
)
# Load the data subset and display the DataFrame
df_clinical_project = pydicer.read_converted_data(dataset_name=clinical_project_name)
df_clinical_project
[10]:
sop_instance_uid | hashed_uid | modality | patient_id | series_uid | for_uid | referenced_sop_instance_uid | path | |
---|---|---|---|---|---|---|---|---|
0 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | 72b0f9 | CT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | NaN | testdata_hnscc/data/HNSCC-01-0199/images/72b0f9 |
1 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421... | 06e49c | RTSTRUCT | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258... | testdata_hnscc/data/HNSCC-01-0199/structures/0... |
2 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.264264397186... | c16e76 | RTDOSE | HNSCC-01-0199 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.233527028792... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112... | testdata_hnscc/data/HNSCC-01-0199/doses/c16e76 |
3 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | c4ffd0 | CT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | NaN | testdata_hnscc/data/HNSCC-01-0176/images/c4ffd0 |
4 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.403955456521... | cbbf5b | RTSTRUCT | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.276897558084... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535... | testdata_hnscc/data/HNSCC-01-0176/structures/c... |
5 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.169033525924... | 833a74 | RTDOSE | HNSCC-01-0176 | 1.3.6.1.4.1.14519.5.2.1.1706.8040.279793773343... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726... | 1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284... | testdata_hnscc/data/HNSCC-01-0176/doses/833a74 |
[ ]: