BioMedical Data Manifest

Data Managers/Computationalists


Cancer Imaging Dataset for Treatment Response Prediction (CITRP)

This dataset contains a collection of cancer imaging data aimed at predicting the clinical response versus non-response to treatment in patients with various types of cancer. The images were obtained from multiple sources, including CT scans, MRIs, and PET scans, with labeled annotations on treatment response. The dataset provides the opportunity for researchers to develop machine learning models to predict treatment efficacy, supporting personalized medicine and better clinical decision-making.
General Information
LINKS
DATA MANIFEST AUTHORS
  • Data Links:
    • Primary Data Portal Link
    • Google Cloud Storage (GCS) Bucket
  • Documentation (If different):
    • Dataset Overview & Guidelines Link
  • ChatGPT
VERSION INFORMATION
KEYWORDS
  • Current Version: v1.2
  • DOI: CITRP-2025-V1
  • Release Date: 2025-02-01
  • Last Updated: 2025-02-18
  • Cancer Imaging
  • Treatment Response Prediction
  • Machine Learning
  • Oncology
  • MRI, CT, PET Scans
  • Medical Imaging
  • Personalized Medicine
  • Response vs Non-response
EXTENSION MECHANISMS
  • Contact dataset authors for ways to contribute
OWNER/PUBLISHER
CONTACT DETAILS
  • Medical Data Collaborative (MDC)
  • Cancer Research Institute (CRI)
    Dataset Contacts:
    • Dr. Jane Doe (Lead Author, CRI)
    • Dr. John Smith (Co-author, University of Medicine)
    • Dr. Emily Johnson (Co-author, Oncology Department)
FUNDING OR GRANT SUMMARY(IES)
    National Cancer Institute (NCI) Grant #12345: "Predictive Models for Cancer Treatment Response" NIH R01 Grant #67890: "Data Sharing and Imaging for Cancer Therapy Optimization" The project also received support from a collaboration with the Cancer Research Institute's Imaging Lab.

Uses of Data
CONFOUNDING FACTORS
DATASET KNOWN PUBLICATIONS AND BENCHMARKS
  • Batch Effects: Variability between different imaging scanners or centers may introduce inconsistencies in image quality and analysis, impacting model generalizability.
  • Patient Demographics: Factors like age, gender, and ethnicity may affect tumor characteristics, treatment response, and imaging results.
  • Imaging Protocol Variability: Differences in imaging protocols (e.g., MRI vs. CT scans) can introduce noise into the dataset, especially when datasets come from multiple hospitals or institutions.
  • Tumor Heterogeneity: Variability within tumor types, such as heterogeneity in tissue composition or response to treatment, may lead to inconsistencies.
  • Treatment Variability: Different types of cancer treatments (chemotherapy, immunotherapy, radiation, etc.) may confound response predictions due to variations in the treatment regimens.
  • DOI 10.1016/j.jclinonc.2020.11.010: "Assessing Predictive Models for Tumor Response in Cancer Imaging using Machine Learning" – This study benchmarks the dataset to evaluate predictive models for assessing cancer treatment response.
  • DOI 10.1136/bmj.2020.036567: "Machine Learning for Tumor Progression Prediction: A Comparative Study" – Utilizes the dataset to compare several machine learning algorithms in predicting tumor progression.
  • URL: https://pubmed.ncbi.nlm.nih.gov/33390076/: "A Comprehensive Review of Image-Based Biomarkers for Tumor Response Prediction" – Discusses the dataset and its applications in a review of imaging biomarkers.
CITATION GUIDELINES
When using this dataset please cite:
  • DOI 10.1016/j.jclinonc.2021.01.018: "Predicting Clinical Outcomes in Cancer Treatment Using Imaging Data" – This paper presents a methodology for predicting treatment response based on the imaging dataset.
  • DOI 10.1038/s41592-021-01014-9: "Cancer Imaging: A Multi-Modal Approach to Predict Tumor Response" – Provides insights on how multi-modal imaging data from the dataset is used in response prediction models.

Dataset Composition
EXPERIMENTAL UNITS
DETAILS
  • Definition of experimental unit: The experimental unit is defined as an individual patient. Each patient is represented by a set of medical images taken at different time points during treatment, and these images are used to predict the patient's response to cancer treatment (responder or non-responder).
  • Relationships between the experimental units: Yes, some experimental units (patients) have longitudinal data, meaning that multiple imaging sessions are captured over time for the same patient. This introduces temporal dependency where each patient’s data is related to their previous and future scans, affecting response prediction.
  • Number of experimental units: The overall sample size is 250 individual patients, each with a series of imaging scans (MRI, CT, PET) collected across multiple time points during treatment.
  • Missingness:
    • Yes, some imaging data are missing due to technical issues, patient non-compliance, or loss to follow-up. Missing data is handled using imputation techniques, where missing values are replaced with predictions based on other data points, or cases with extensive missing data are excluded from specific analyses.
  • Sampling Method(s):
    • Convenience;Stratified
  • Data anomalies/errors:
    • Artifacts in Imaging: Some images contain artifacts due to motion or scanner issues, which are flagged and removed before analysis. Mislabeling of Clinical Data: There were some discrepancies in the labeling of response outcomes, which were manually reviewed and corrected by clinical experts.

Ethical, Legal and Social Issues (ELSI)
Inclusion: Age: Patients aged 18 and older at the time of treatment. Diagnosis: Histologically confirmed cancer diagnosis. Treatment: Patients receiving one of the pre-defined treatments (chemotherapy, immunotherapy, radiation). Imaging: At least two sets of medical imaging available (baseline and follow-up). Consent: Signed informed consent for data use in research.
Confidentiality: Yes, the data includes sensitive patient information protected under doctor-patient confidentiality. Any personally identifiable information (PII) has been anonymized or pseudonymized to meet the standards of HIPAA (Health Insurance Portability and Accountability Act) in the U.S. However, specific imaging data could still pose risks of re-identification if not handled appropriately.
Considerations for external resources utilized: Public Datasets: Some imaging data are drawn from public cancer imaging repositories (e.g., The Cancer Imaging Archive). The use of these datasets must comply with the respective data-sharing agreements and terms of use. Proprietary Software: The dataset was pre-processed using proprietary software tools, which may require licensing for future use.
Human Subjects
REGULATORY
OTHER CONSIDERATIONS
  • REB/IRB approval for data collection?: Yes
  • Were subjects consented?: Yes
  • Was consent revocable?: No
  • Considerations needed for future data updates: None
  • Sample were collected from patients in the following countries:
    • The data was collected from the United States, Canada, and Germany, with a focus on urban hospitals and specialized cancer treatment centers. Data collection from these regions represents a mix of healthcare systems and patient demographics.
  • Known ELS issues introduced by preprocessing: Bias in Sampling: Preprocessing steps may unintentionally introduce sampling bias if certain groups (e.g., non-white or underrepresented ethnicities) are underrepresented due to exclusion criteria or underreporting in medical records. Data Anonymization: While efforts have been made to anonymize all personal data, there is a possibility that some datasets may inadvertently retain re-identifiable information, particularly with the use of imaging data.
  • Known problematic proxies: Socioeconomic Status: Proxy variables related to socioeconomic status (e.g., insurance type, income levels) may correlate with race/ethnicity, potentially leading to biased outcomes in predictive models. Geographical Location: Patients from urban areas may be overrepresented, while rural patients might be underrepresented, influencing the generalizability of the dataset to broader populations.
  • Was a data protection impact analysis performed?: No
Collected Sensitive Data Fields
Field Name Definition
Race Self-reported race/ethnicity information, which may be relevant for studying disparities in treatment responses.
Gender Gender (male/female) data, which could impact response predictions for certain cancer types due to biological differences.
Age Patients' age is available, and the dataset includes both pediatric and adult populations, making age a relevant factor in treatment response.
Medical History Data on prior medical conditions and treatments that could be considered personally identifiable information (PII) under HIPAA regulations.

LICENSING
  • IP / Terms of Use:
    • The intellectual property (IP) for the dataset resides with the originating institutions, but the dataset is made available under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Researchers must provide proper attribution and ensure that the dataset is not used for commercial purposes without additional permissions.
  • Third party intellectual property considerations:
    • Some imaging data incorporated in the dataset were obtained from third-party repositories, such as The Cancer Imaging Archive. These data are subject to the terms of use provided by the respective repository, which typically allow for non-commercial research use only.
  • Export controls or other regulations impacting dataset access/download or storage:
    • Yes, some of the dataset's components, particularly those involving genomic data, are subject to export control regulations under U.S. law. The use and sharing of this data outside the United States must comply with U.S. export laws, and specific restrictions may apply to international collaborations.

Provenance and Lineage
Description of the data collection methodology: Data Sources: Data were obtained from multiple hospitals and cancer centers specializing in oncology treatments. This includes electronic health records (EHR), medical imaging archives, and genomic sequencing data. Collection Techniques: Imaging data were collected during routine clinical visits at pre-specified timepoints (baseline, post-treatment, follow-up). Clinical data were sourced from EHRs, and genomic data were collected via whole-exome sequencing during routine cancer treatment workflows. Link to MOP: [Insert URL or description of the manual] (e.g., https://www.dataset-mop.com)
Specific data collection devices:
  • CT Scanner: Siemens SOMATOM Force (Model No. 12345, Software Version 1.0). Acquisition parameters: slice thickness of 1.0 mm, 120 kVp, 200 mA.
  • MRI Scanner: GE Signa Premier 3T (Model No. 67890, Software Version 2.1). Imaging protocols: T1-weighted, T2-weighted, contrast-enhanced imaging sequences.
  • Genomic Sequencing: Illumina NovaSeq 6000, with a minimum read depth of 30x per sample.
Dataset Curation: Manual Curation: Imaging features (e.g., tumor size, shape, and texture) were manually extracted by radiologists from CT and MRI images. These features were then used as input variables for predictive models. Clinical Data Curation: Patient clinical records were manually reviewed by a team of clinicians to ensure accuracy in variables such as tumor stage, treatment regimens, and response classification. Automated Feature Extraction: For genomic data, mutations were identified using bioinformatics pipelines. These features were automatically extracted and curated based on predefined algorithms for mutation discovery and variant annotation.
Dataset Validation:
  • Targeted Resequencing: Whole-exome sequencing was performed for a subset of patients to validate key mutations identified in initial genome-wide association studies (GWAS). This aimed to confirm the presence of mutations that were initially discovered through broader genomic data.
  • Expert Pathologist Review: Imaging data were validated by a team of radiologists and oncologists who reviewed the accuracy of tumor segmentation and treatment response classifications.
  • Cross-validation with External Datasets: The dataset was cross-validated against external benchmarks from other publicly available cancer imaging datasets, including those from The Cancer Imaging Archive (TCIA), to ensure consistency and reliability.
Data collection timeframe: January 1, 2018 to December 31, 2023
Preprocessing steps/workflow:
  • Preprocessing for Imaging: All medical imaging data were standardized to a common resolution and normalized for intensity. Tumor regions of interest (ROIs) were manually outlined by radiologists and used for feature extraction.
  • Preprocessing for Genomic Data: Raw sequencing reads were aligned to the reference genome using BWA (Burrows-Wheeler Aligner), and variant calling was performed using GATK (Genome Analysis Toolkit).
  • Preprocessing for Clinical Data: Clinical data were cleaned and anonymized. Missing data were handled using multiple imputation methods where applicable.

Labeling Provenance and Lineage
Original predictive task associated with the labels: The original predictive task is to classify cancer patients into two categories based on their treatment response. This classification task serves as the basis for building predictive models to forecast response based on imaging and clinical data.
Are the actual labels provided in the dataset or only summarized?: The actual treatment response labels (responder/non-responder) are included in the dataset for each patient, along with the corresponding clinical and imaging data that were used to determine the label.
CLASS FREQUENCY OF LABELS

Responder: 40% (N=100) Non-Responder: 60% (N=150)

Guidelines for determining label values: Responder/Non-Responder Classification: The determination is based on tumor size changes measured by MRI or CT scan according to RECIST 1.1 criteria. Responder: ≥30% decrease in the sum of the longest diameters of target lesions. Non-Responder: <30% decrease, stable disease, or progression.
Details of software used for generating labels: The OncoAI Platform was used for initial tumor segmentation and response prediction based on imaging data. This software uses deep learning algorithms to detect and quantify tumor regions, and the output is integrated with clinical data to classify treatment response. Link: OncoAI Platform
Labels with corresponding gold-standard benchmarks or proxies:
  • Responder vs Non-Responder were evaluated using Response Evaluation Criteria in Solid Tumors (RECIST 1.1) as well as expert pathologists who reviewed histopathology slides to confirm tumor response at the cellular level.
Rater agreement/disagreement: Rater Agreement: The concordance between radiologists and pathologists in labeling treatment response was evaluated using Cohen’s kappa coefficient, which showed a 0.85 agreement, indicating strong consistency across raters. Disagreements: In cases of rater disagreement, the labels were reviewed by a senior clinician for final determination.

Maintenance and Distribution
CURRENT INFORMATION
PREVIOUS/FUTURE UPDATES
  • Dataset last updated: The dataset was last updated on January 15, 2025, following the addition of new patient imaging data from a recent cohort of clinical trials.
  • Type of Versioning: The dataset follows dynamic versioning, meaning updates are regularly made to reflect new data entries, changes in classification, or corrections to previously recorded data. The dataset is continuously maintained to ensure it includes the latest available data.
  • Data Maintainer: The dataset is maintained by the Biomedical Research Data Management Team at XYZ Medical Research Institute, under the direction of Dr. Jane Smith, Principal Investigator, and the Research Informatics Department.
  • Maintainer Contact Info:
  • Number of dataset versions or releases currently available: 3
  • More versions expected?: Yes
  • Description of versions or releases:
    • Version 1.0 (March 2023): The initial release, which included basic demographic and imaging data for 500 patients treated for a specific condition. Only clinical and radiological features were included at this stage.
    • Version 2.0 (July 2024): This version expanded the dataset to include clinical follow-up data, patient treatment regimens, and post-treatment imaging, along with a refined classification of patient responses to treatment (e.g., responder/non-responder). New imaging modalities, such as CT scans, were added.
    • Version 3.0 (January 2025): The most recent update, adding data from an additional 300 patients and including new genomic data, such as biomarker levels and sequencing results. Labeling has been updated for several cases based on feedback from clinical experts. Additional machine learning-derived features have also been incorporated.
  • Future updates:
    • Genomic Data: Genomic data, such as biomarkers and sequencing results, are not expected to change frequently. Once new markers are identified, they will be added in updates, but this data will largely remain stable over time.
    • Clinical Data: Clinical information (e.g., patient follow-up, treatment protocols, imaging results) will be updated periodically, particularly as new cohorts are enrolled in studies or as follow-up data becomes available from ongoing clinical trials.
    • Imaging Data: As new imaging modalities and follow-up imaging scans are collected, this information will be updated regularly. We also plan to include more advanced imaging analysis techniques, including AI-driven insights, in future updates.