TCGA-LAML BioMedical Data Manifest

GDC Project ID: TCGA-LAML

GDC Summary page not available

General Information

LINKS

DATA MANIFEST AUTHORS

Data Links:
- Raw:
  - dbGaP
  - Genomic Data Commons
- Processed:
  - Genomic Data Commons
Documentation (If different):
- For GDC specific workflows see this link, otherwise accompanying dataset manuscripts

Data Manifest aggregated from GDC

VERSION INFORMATION

KEYWORDS

Current Version: Data Release 42.0
DOI: N/A
Release Date: January 30, 2025
Last Updated: N/A

Hematopoietic and reticuloendothelial systems
Myeloid Leukemias

EXTENSION MECHANISMS

Contact Dataset Owners/Publishers for ways to contribute

OWNER/PUBLISHER

CONTACT DETAILS

Dataset Contacts:

DATASET ORIGINAL USE

CONCERNS AND LIMITATIONS

General Research Use: Use of the data is limited only by the terms of the model Data Use Certification.

CONFOUNDING FACTORS

DATASET KNOWN PUBLICATIONS AND BENCHMARKS

When using this dataset please cite:

Use of data from dbGaP: Please cite/reference the use of dbGaP data by including the dbGaP accession phs000178
Acknowledgement Statement: The results published here are in whole or part based upon data generated by The Cancer Genome Atlas managed by the NCI and NHGRI. Information about TCGA can be found at http://cancergenome.nih.gov.

DATA SUBJECT(S)

DATASET SNAPSHOT

DETAILS

Each experimental unit refers to a patient sample
Each patient (Case) can have multiple experimental units corresponding to for example different timepoints or tissue sources etc.

**Above:** Overall statistics of the open dataset.
Open Access
Size of Dataset	11.159 GB
Number of Cases	200
Number of Files	4,099
File Types	BCR Biotab,TSV,TXT,CEL,IDAT,BCR XML,MAF

**Above:** Overall statistics of the controlled dataset.
Controlled Access
Size of Dataset	43.119 TB
Number of Cases	200
Number of Files	4,740
File Types	VCF,TSV,MAF,BEDPE,BAM,CEL

Modalities:
- Simple Nucleotide Variation
- Sequencing Reads
- Biospecimen
- Clinical
- Copy Number Variation
- Transcriptome Profiling
- DNA Methylation
- Structural Variation
Missingness:
- Clinical data due to lack of access to records or curation, missing typically represented by blank, NA, Unknown entries etc
- Not all genomics data is captured for each patient sample
- Additionally, all genomic assays may have regions below LOD/background
Sampling:
- N/A
Data anomalies/errors:
- N/A

SUMMARY OF INSTANCES AND DATATYPES

Sample Types vs Modalities (Click to expand/hide)

**Above:** Number of instances per sample type and modality for open access data
Open Access Modalities
Instance Type	Simple Nucleotide Variation	Copy Number Variation	Transcriptome Profiling	DNA Methylation
Buccal Cell Normal	1	0	0	0
Primary Blood Derived Cancer - Peripheral Blood	149	200	203	194
Solid Tissue Normal	143	200	0	0

**Above:** Number of instances per sample type and modality for controlled access data
Controlled Access Modalities
Instance Type	Simple Nucleotide Variation	Copy Number Variation	Structural Variation	Transcriptome Profiling
Buccal Cell Normal	1	0	0	0
Primary Blood Derived Cancer - Peripheral Blood	270	200	151	151
Solid Tissue Normal	261	200	0	0

Additional Notes: Not all these instances are used in a given analyses. See respective publications or other details in this document for information

Sample Types vs Diagnoses (Click to expand/hide)

**Above:** Number of instances per sample type and diagnosis
Instance Type	Myeloid Leukemias
Buccal Cells	1
Peripheral Blood NOS	356
Solid Tissue	340

Clinical Variables vs Diagnoses (Click to expand/hide)

**Above:** Number of cases per clinical variable and diagnosis
Clinical Variable	Myeloid Leukemias
race	200
gender	200
ethnicity	200
vital_status	200
age_at_index	200
days_to_birth	200
age_is_obfuscated	200
days_to_death	120
state	200
synchronous_malignancy	200
days_to_diagnosis	200
tissue_or_organ_of_origin	200
age_at_diagnosis	200
primary_diagnosis	200
prior_malignancy	200
year_of_diagnosis	200
state	200
prior_treatment	200
diagnosis_is_primary_disease	200
morphology	200
classification_of_tumor	200
fab_morphology_code	200
icd_10_code	200
site_of_resection_or_biopsy	200
calgb_risk_group	197
chemical_exposure_type	4
exposure_type	4
state	4

Specific Diagnosis vs Diagnoses (Click to expand/hide)

**Above:** Number of instances per specific diagnosis and diagnosis
Primary Diagnosis	Myeloid Leukemias
Acute megakaryoblastic leukaemia	14
Acute monocytic leukemia	77
Acute myeloid leukemia with maturation	154
Acute myeloid leukemia without maturation	138
Acute myeloid leukemia, M6 type	20
Acute myeloid leukemia, NOS	4
Acute myeloid leukemia, minimal differentiation	65
Acute myelomonocytic leukemia	148
Acute promyelocytic leukaemia, t(15;17)(q22;q11-12)	77

Ethical, Legal and Social Issues (ELSI)

Inclusion: TCGA utilizes a strict set of criteria for inclusion into the study due to the rigorous and comprehensive nature of the work being performed. Tumor samples and matched source of germline DNA are curated and processed by the Biospecimen Core Resource, a centralized site that reviews sample data and processes all samples to ensure consistent pathology assessment and generation of molecular analytes (DNA and RNA). TCGA is focusing on primary untreated tumors that were snap frozen upon collection. All tumors must have a matched normal sample from the same patient. In many cases, the matched normal is a sample of the patient's blood. Once at the BCR, all samples are subjected to a quality control protocol before they are accepted for full analysis into the TCGA pipeline. Each sample is reviewed by a pathologist to confirm the diagnosis and that the sample meets inclusion criteria. Specifically, TCGA requires that samples contain at least 60% tumor nuclei and have less than 20% necrotic tissue. Once the sample passes the pathology review, nucleic acids are isolated and genotyping is performed so that each tumor sample is properly associated with the correct normal tissue. An important goal in establishing this central resource is to ensure that molecular analytes (i.e. DNA and RNA) extracted from tissue samples are of consistent and high quality. Next, these analytes, undergo a molecular quality control process and then are distributed to TCGA Cancer Genome Characterization Centers and Genome Sequencing Centers for genomic analysis. All samples in TCGA have been collected and utilized following strict policies and guidelines for the protection of human subjects, informed consent and IRB review of protocols.

Confidentiality: Usage of Open Access data has been deemed safe for general usage with minimal risk to patients. Restricted Access data could be used to potentially re-identify a given patient and so needs special requirements to access.

Considerations for external resources utilized: N/A

REGULATORY

OTHER CONSIDERATIONS

REB/IRB approval for data collection?: Generally, for these studies all patients have provided informed consent but see dbGaP for more details

Sample were collected from patients in the following countries:
- N/A
Known ELS issues introduced by preprocessing: N/A
Known problematic proxies: N/A
Was a data protection impact analysis performed?: N/A

Collected Sensitive Data Fields
Field Name	Definition	Distribution
gender	Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles. [Explanatory Comment 1: Identification of gender is based upon self-report and may come from a form, questionnaire, interview, etc.]
race	An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
ethnicity	An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
age_at_diagnosis	Age at the time of diagnosis expressed in number of days since birth.
submitter_id	All submitted IDs can potentially contain identifiable information

Intentionality of sensitive human attribute collection: N/A

IP / Terms of Use:
- Raw data available from the Genomic Data Commons with access granted through dbGaP and adherence to their policies
Third party intellectual property considerations:
- N/A
Export controls or other regulations impacting dataset access/download or storage:
- N/A

Provenance and Lineage

Description of the data collection methodology: N/A

Specific data collection devices:

Dataset Curation: N/A

Dataset Validation:

Data collection timeframe: N/A

Preprocessing steps/workflow:

Pre-processing: Generation and processing of raw data is described in detail in existing publications
Post-processing and derived features: Generation of processed data by the Genomic Data Commons is described here

BioMedical Data Manifest

Aggregator

GDC Project ID: TCGA-LAML