Cancer Imaging Dataset for Treatment Response Prediction (CITRP)
This dataset contains a collection of cancer imaging data aimed at predicting the clinical response versus non-response to treatment in patients with various types of cancer. The images were obtained from multiple sources, including CT scans, MRIs, and PET scans, with labeled annotations on treatment response. The dataset provides the opportunity for researchers to develop machine learning models to predict treatment efficacy, supporting personalized medicine and better clinical decision-making.
LINKS
DATA MANIFEST AUTHORS
- Data Links:
- Primary Data Portal Link
- Google Cloud Storage (GCS) Bucket
- Documentation (If different):
- Dataset Overview & Guidelines Link
VERSION INFORMATION
KEYWORDS
- Current Version: v1.2
- DOI: CITRP-2025-V1
- Release Date: 2025-02-01
- Last Updated: 2025-02-18
- Cancer Imaging
- Treatment Response Prediction
- Machine Learning
- Oncology
- MRI, CT, PET Scans
- Medical Imaging
- Personalized Medicine
- Response vs Non-response
- Contact dataset authors for ways to contribute
OWNER/PUBLISHER
CONTACT DETAILS
- Medical Data Collaborative (MDC)
- Cancer Research Institute (CRI)
Dataset Contacts:
- Dr. Jane Doe (Lead Author, CRI)
- Dr. John Smith (Co-author, University of Medicine)
- Dr. Emily Johnson (Co-author, Oncology Department)
FUNDING OR GRANT SUMMARY(IES)
National Cancer Institute (NCI) Grant #12345: "Predictive Models for Cancer Treatment Response"
NIH R01 Grant #67890: "Data Sharing and Imaging for Cancer Therapy Optimization"
The project also received support from a collaboration with the Cancer Research Institute's Imaging Lab.
Inclusion: Age: Patients aged 18 and older at the time of treatment. Diagnosis: Histologically confirmed cancer diagnosis. Treatment: Patients receiving one of the pre-defined treatments (chemotherapy, immunotherapy, radiation). Imaging: At least two sets of medical imaging available (baseline and follow-up). Consent: Signed informed consent for data use in research.
Confidentiality: Yes, the data includes sensitive patient information protected under doctor-patient confidentiality. Any personally identifiable information (PII) has been anonymized or pseudonymized to meet the standards of HIPAA (Health Insurance Portability and Accountability Act) in the U.S. However, specific imaging data could still pose risks of re-identification if not handled appropriately.
Considerations for external resources utilized: Public Datasets: Some imaging data are drawn from public cancer imaging repositories (e.g., The Cancer Imaging Archive). The use of these datasets must comply with the respective data-sharing agreements and terms of use. Proprietary Software: The dataset was pre-processed using proprietary software tools, which may require licensing for future use.
REGULATORY
OTHER CONSIDERATIONS
- REB/IRB approval for data collection?: Yes
- Were subjects consented?: Yes
- Was consent revocable?: No
- Considerations needed for future data updates: None
- Sample were collected from patients in the following countries:
- The data was collected from the United States, Canada, and Germany, with a focus on urban hospitals and specialized cancer treatment centers. Data collection from these regions represents a mix of healthcare systems and patient demographics.
- Known ELS issues introduced by preprocessing: Bias in Sampling: Preprocessing steps may unintentionally introduce sampling bias if certain groups (e.g., non-white or underrepresented ethnicities) are underrepresented due to exclusion criteria or underreporting in medical records. Data Anonymization: While efforts have been made to anonymize all personal data, there is a possibility that some datasets may inadvertently retain re-identifiable information, particularly with the use of imaging data.
- Known problematic proxies: Socioeconomic Status: Proxy variables related to socioeconomic status (e.g., insurance type, income levels) may correlate with race/ethnicity, potentially leading to biased outcomes in predictive models.
Geographical Location: Patients from urban areas may be overrepresented, while rural patients might be underrepresented, influencing the generalizability of the dataset to broader populations.
- Was a data protection impact analysis performed?: No
| Collected Sensitive Data Fields |
| Field Name |
Definition |
| Race |
Self-reported race/ethnicity information, which may be relevant for studying disparities in treatment responses. |
| Gender |
Gender (male/female) data, which could impact response predictions for certain cancer types due to biological differences. |
| Age |
Patients' age is available, and the dataset includes both pediatric and adult populations, making age a relevant factor in treatment response. |
| Medical History |
Data on prior medical conditions and treatments that could be considered personally identifiable information (PII) under HIPAA regulations. |
- IP / Terms of Use:
- The intellectual property (IP) for the dataset resides with the originating institutions, but the dataset is made available under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Researchers must provide proper attribution and ensure that the dataset is not used for commercial purposes without additional permissions.
- Third party intellectual property considerations:
- Some imaging data incorporated in the dataset were obtained from third-party repositories, such as The Cancer Imaging Archive. These data are subject to the terms of use provided by the respective repository, which typically allow for non-commercial research use only.
- Export controls or other regulations impacting dataset access/download or storage:
- Yes, some of the dataset's components, particularly those involving genomic data, are subject to export control regulations under U.S. law. The use and sharing of this data outside the United States must comply with U.S. export laws, and specific restrictions may apply to international collaborations.
Description of the data collection methodology: Data Sources: Data were obtained from multiple hospitals and cancer centers specializing in oncology treatments. This includes electronic health records (EHR), medical imaging archives, and genomic sequencing data.
Collection Techniques: Imaging data were collected during routine clinical visits at pre-specified timepoints (baseline, post-treatment, follow-up). Clinical data were sourced from EHRs, and genomic data were collected via whole-exome sequencing during routine cancer treatment workflows.
Link to MOP: [Insert URL or description of the manual] (e.g., https://www.dataset-mop.com)
Specific data collection devices:
- CT Scanner: Siemens SOMATOM Force (Model No. 12345, Software Version 1.0). Acquisition parameters: slice thickness of 1.0 mm, 120 kVp, 200 mA.
- MRI Scanner: GE Signa Premier 3T (Model No. 67890, Software Version 2.1). Imaging protocols: T1-weighted, T2-weighted, contrast-enhanced imaging sequences.
- Genomic Sequencing: Illumina NovaSeq 6000, with a minimum read depth of 30x per sample.
Dataset Curation: Manual Curation: Imaging features (e.g., tumor size, shape, and texture) were manually extracted by radiologists from CT and MRI images. These features were then used as input variables for predictive models.
Clinical Data Curation: Patient clinical records were manually reviewed by a team of clinicians to ensure accuracy in variables such as tumor stage, treatment regimens, and response classification.
Automated Feature Extraction: For genomic data, mutations were identified using bioinformatics pipelines. These features were automatically extracted and curated based on predefined algorithms for mutation discovery and variant annotation.
Dataset Validation:
- Targeted Resequencing: Whole-exome sequencing was performed for a subset of patients to validate key mutations identified in initial genome-wide association studies (GWAS). This aimed to confirm the presence of mutations that were initially discovered through broader genomic data.
- Expert Pathologist Review: Imaging data were validated by a team of radiologists and oncologists who reviewed the accuracy of tumor segmentation and treatment response classifications.
- Cross-validation with External Datasets: The dataset was cross-validated against external benchmarks from other publicly available cancer imaging datasets, including those from The Cancer Imaging Archive (TCIA), to ensure consistency and reliability.
Data collection timeframe: January 1, 2018 to December 31, 2023
Preprocessing steps/workflow:
- Preprocessing for Imaging: All medical imaging data were standardized to a common resolution and normalized for intensity. Tumor regions of interest (ROIs) were manually outlined by radiologists and used for feature extraction.
- Preprocessing for Genomic Data: Raw sequencing reads were aligned to the reference genome using BWA (Burrows-Wheeler Aligner), and variant calling was performed using GATK (Genome Analysis Toolkit).
- Preprocessing for Clinical Data: Clinical data were cleaned and anonymized. Missing data were handled using multiple imputation methods where applicable.
Original predictive task associated with the labels: The original predictive task is to classify cancer patients into two categories based on their treatment response. This classification task serves as the basis for building predictive models to forecast response based on imaging and clinical data.
Are the actual labels provided in the dataset or only summarized?: The actual treatment response labels (responder/non-responder) are included in the dataset for each patient, along with the corresponding clinical and imaging data that were used to determine the label.
CLASS FREQUENCY OF LABELS
Responder: 40% (N=100)
Non-Responder: 60% (N=150)
Guidelines for determining label values: Responder/Non-Responder Classification: The determination is based on tumor size changes measured by MRI or CT scan according to RECIST 1.1 criteria.
Responder: ≥30% decrease in the sum of the longest diameters of target lesions.
Non-Responder: <30% decrease, stable disease, or progression.
Details of software used for generating labels: The OncoAI Platform was used for initial tumor segmentation and response prediction based on imaging data. This software uses deep learning algorithms to detect and quantify tumor regions, and the output is integrated with clinical data to classify treatment response.
Link: OncoAI Platform
Labels with corresponding gold-standard benchmarks or proxies:
- Responder vs Non-Responder were evaluated using Response Evaluation Criteria in Solid Tumors (RECIST 1.1) as well as expert pathologists who reviewed histopathology slides to confirm tumor response at the cellular level.
Rater agreement/disagreement: Rater Agreement: The concordance between radiologists and pathologists in labeling treatment response was evaluated using Cohen’s kappa coefficient, which showed a 0.85 agreement, indicating strong consistency across raters.
Disagreements: In cases of rater disagreement, the labels were reviewed by a senior clinician for final determination.