This README.txt file was generated on 2022-08-03 by Anders Dohlman at Duke University. GENERAL INFORMATION 1. Title of Dataset "The Cancer Microbiome Atlas (TCMA): A Pan-Cancer Comparative Analysis to Distinguish Organ-Associated Microbiota from Equiprevalent Contaminants" 2. Author Information A. Principal Investigator Contact Information Name: Xiling Shen Institution: Duke University Address: Durham, NC Email: xiling.shen@duke.edu, xiling.shen@xilis.net B. First Author Contact Information Name: Anders Dohlman Institution: Duke University Address: Durham, NC Email: anders.dohlman@duke.edu, abdohlman@gmail.com 3. Dates of data collection: 2017-09-01 to 2020-09-01. 4. Keywords Human microbiome, The Cancer Genome Atlas, Colorectal cancer, Contamination, Host-microbe interactions, Multi-omics, Pan-cancer 5. Information about funding sources that supported the collection of the data: This study was supported by NIH R35GM122465, DK119795 and DARPA W911NF1920111. DATA & FILE OVERVIEW Updates as of 6/26/2022: The decontamination method has been updated. Due to low sample size, the ESCA dataset is now decontaminated using the tissue-resident species identified by analyzing the STAD dataset (which has a similar microbial composition). We feel this improves the ESCA dataset which was previously quite sparse. Additionally, a change was made in the way that decontamination mixtures were calculated for higher-order taxa. Instead of calculating mixtures based on species- level classifications, mixtures are calculated based on the ratio of endogenous-to- contaminant reads from each taxonomic rank below (genus from species, family from genus, order from family etc.). Previously all mixtures were estimated from the species level. This should improve estimations for higher order taxa. Fields reference: experimental strategy used to generate compositions (WGS or WXS) type of sample analyzed ("tissue" or "blood") level to which data are aggregated (sample, case, file/sequencing-run) reads: raw read counts rpm: reads-per million relabund: relative abundance 1. File List: Taxonomy dictionary (tsv) Index of NCBI taxonomy IDs and corresponding species names, taxonomy, rank, etc. metadata/taxonomy.txt Metadata files (tsv): Metadata information at the file-, sample-, and case-level. metadata/metadata.TCMA.file.txt metadata/metadata.TCMA.sample.txt metadata/metadata.TCMA.case.txt Data files (tsv): Contains bacterial abundance data for each taxon in the dataset. Format: bacteria.....txt" WGS/file/bacteria.unambiguous.decontam.tissue.file.reads.txt WGS/file/bacteria.unambiguous.decontam.tissue.file.rpm.txt WGS/file/bacteria.unambiguous.decontam.tissue.file.rpm.relabund.txt WGS/file/bacteria.unambiguous.decontam.blood.file.reads.txt WGS/file/bacteria.unambiguous.decontam.blood.file.rpm.txt WGS/file/bacteria.unambiguous.decontam.blood.file.rpm.relabund.txt WGS/sample/bacteria.unambiguous.decontam.tissue.sample.reads.txt WGS/sample/bacteria.unambiguous.decontam.tissue.sample.rpm.txt WGS/sample/bacteria.unambiguous.decontam.tissue.sample.rpm.relabund.txt WGS/sample/bacteria.unambiguous.decontam.blood.sample.reads.txt WGS/sample/bacteria.unambiguous.decontam.blood.sample.rpm.txt WGS/sample/bacteria.unambiguous.decontam.blood.sample.rpm.relabund.txt WGS/case/bacteria.unambiguous.decontam.tissue.case.reads.txt WGS/case/bacteria.unambiguous.decontam.tissue.case.rpm.txt WGS/case/bacteria.unambiguous.decontam.tissue.case.rpm.relabund.txt WGS/case/bacteria.unambiguous.decontam.blood.case.reads.txt WGS/case/bacteria.unambiguous.decontam.blood.case.rpm.txt WGS/case/bacteria.unambiguous.decontam.blood.case.rpm.relabund.txt WXS/file/bacteria.unambiguous.decontam.tissue.file.reads.txt WXS/file/bacteria.unambiguous.decontam.tissue.file.rpm.txt WXS/file/bacteria.unambiguous.decontam.tissue.file.rpm.relabund.txt WXS/file/bacteria.unambiguous.decontam.blood.file.reads.txt WXS/file/bacteria.unambiguous.decontam.blood.file.rpm.txt WXS/file/bacteria.unambiguous.decontam.blood.file.rpm.relabund.txt WXS/sample/bacteria.unambiguous.decontam.tissue.sample.reads.txt WXS/sample/bacteria.unambiguous.decontam.tissue.sample.rpm.txt WXS/sample/bacteria.unambiguous.decontam.tissue.sample.rpm.relabund.txt WXS/sample/bacteria.unambiguous.decontam.blood.sample.reads.txt WXS/sample/bacteria.unambiguous.decontam.blood.sample.rpm.txt WXS/sample/bacteria.unambiguous.decontam.blood.sample.rpm.relabund.txt WXS/case/bacteria.unambiguous.decontam.tissue.case.reads.txt WXS/case/bacteria.unambiguous.decontam.tissue.case.rpm.txt WXS/case/bacteria.unambiguous.decontam.tissue.case.rpm.relabund.txt WXS/case/bacteria.unambiguous.decontam.blood.case.reads.txt WXS/case/bacteria.unambiguous.decontam.blood.case.rpm.txt WXS/case/bacteria.unambiguous.decontam.blood.case.rpm.relabund.txt PhyloSeq objects (Rds): ".. .Rds" WGS/file/physeq.bacteria.unambiguous.decontam.tissue.file.reads.species.Rds WGS/file/physeq.bacteria.unambiguous.decontam.tissue.file.rpm.species.Rds WGS/file/physeq.bacteria.unambiguous.decontam.tissue.file.rpm.relabund.species.Rds WGS/file/physeq.bacteria.unambiguous.decontam.blood.file.reads.species.Rds WGS/file/physeq.bacteria.unambiguous.decontam.blood.file.rpm.species.Rds WGS/file/physeq.bacteria.unambiguous.decontam.blood.file.rpm.relabund.species.Rds WGS/sample/physeq.bacteria.unambiguous.decontam.tissue.sample.reads.species.Rds WGS/sample/physeq.bacteria.unambiguous.decontam.tissue.sample.rpm.species.Rds WGS/sample/physeq.bacteria.unambiguous.decontam.tissue.sample.rpm.relabund.species.Rds WGS/sample/physeq.bacteria.unambiguous.decontam.blood.sample.reads.species.Rds WGS/sample/physeq.bacteria.unambiguous.decontam.blood.sample.rpm.species.Rds WGS/sample/physeq.bacteria.unambiguous.decontam.blood.sample.rpm.relabund.species.Rds WGS/case/physeq.bacteria.unambiguous.decontam.tissue.case.reads.species.Rds WGS/case/physeq.bacteria.unambiguous.decontam.tissue.case.rpm.species.Rds WGS/case/physeq.bacteria.unambiguous.decontam.tissue.case.rpm.relabund.species.Rds WGS/case/physeq.bacteria.unambiguous.decontam.blood.case.reads.species.Rds WGS/case/physeq.bacteria.unambiguous.decontam.blood.case.rpm.species.Rds WGS/case/physeq.bacteria.unambiguous.decontam.blood.case.rpm.relabund.species.Rds WXS/file/physeq.bacteria.unambiguous.decontam.tissue.file.reads.species.Rds WXS/file/physeq.bacteria.unambiguous.decontam.tissue.file.rpm.species.Rds WXS/file/physeq.bacteria.unambiguous.decontam.tissue.file.rpm.relabund.species.Rds WXS/file/physeq.bacteria.unambiguous.decontam.blood.file.reads.species.Rds WXS/file/physeq.bacteria.unambiguous.decontam.blood.file.rpm.species.Rds WXS/file/physeq.bacteria.unambiguous.decontam.blood.file.rpm.relabund.species.Rds WXS/sample/physeq.bacteria.unambiguous.decontam.tissue.sample.reads.species.Rds WXS/sample/physeq.bacteria.unambiguous.decontam.tissue.sample.rpm.species.Rds WXS/sample/physeq.bacteria.unambiguous.decontam.tissue.sample.rpm.relabund.species.Rds WXS/sample/physeq.bacteria.unambiguous.decontam.blood.sample.reads.species.Rds WXS/sample/physeq.bacteria.unambiguous.decontam.blood.sample.rpm.species.Rds WXS/sample/physeq.bacteria.unambiguous.decontam.blood.sample.rpm.relabund.species.Rds WXS/case/physeq.bacteria.unambiguous.decontam.tissue.case.reads.species.Rds WXS/case/physeq.bacteria.unambiguous.decontam.tissue.case.rpm.species.Rds WXS/case/physeq.bacteria.unambiguous.decontam.tissue.case.rpm.relabund.species.Rds WXS/case/physeq.bacteria.unambiguous.decontam.blood.case.reads.species.Rds WXS/case/physeq.bacteria.unambiguous.decontam.blood.case.rpm.species.Rds WXS/case/physeq.bacteria.unambiguous.decontam.blood.case.rpm.relabund.species.Rds 2. File format All ".txt" (text files) are tab-separated format (tsv). All ".Rds" (R single-object storage) contains a phyloseq object which can be opened in R with the phyloseq package (https://joey711.github.io/phyloseq/index.html). 3. Data relationship Data columns and metadata rows are matched by assay/tissue-type/data-level. Data rows contain NCBI taxonomy ids defined in taxonomy.txt. 4. Date the files were created 2020-09-29 5. Date the files were updated 2022-07-26 6. Relation to external datasets Data sample and case barcodes match the TCGA naming conventions. For file-level data, the TCGA experiment UUID is given. All metadata were obtained through the GDC API. Metadata variable descriptions can be found at the GDC documentation (https://docs.gdc.cancer.gov/API/). SHARING AND ACCESS INFORMATION 1. Licenses The data is available via Creative Commons Zero (CC0). Users are free to use the database and its content in new and different ways, provided they provide attribution to the source of the data and/or the database. 2. Links to publications DOI: 0.1016/j.chom.2020.12.001. 3. Links to publicly accessible locations Explore the database at http://tcma.pratt.duke.edu 4. Recommended citation for the data Dohlman, A.B., Arguijo Mendoza, D., Ding, S., Gao, M., Dressman, H., Iliev, I.D., Lipkin, S.M., and Shen, X. (2020). The cancer microbiome atlas: a pan-cancer comparative analysis to distinguish tissue-resident microbiota from contaminants. Cell Host Microbe. 10.1016/j.chom.2020.12.001. METHODOLOGICAL INFORMATION 1. Description of methods used for collection/generation of data: Studying the microbial composition of internal organs and their associations with disease remains challenging due to the difficulty of acquiring clinical biopsies. We designed a statistical model to analyze the prevalence of species across sample types from The Cancer Genome Atlas (TCGA), revealing that species equiprevalent across sample types are predominantly contaminants, bearing unique signatures from each TCGA- designated sequencing center. Removing such species mitigated batch effects and isolated the tissue-resident microbiome, which was validated with original matched TCGA samples. “Mixed-evidence” species can be further distinguished by gene copy and nucleotide variants. We thus present The Cancer Microbiome Atlas (TCMA), a collection of curated, decontaminated microbial compositions of oropharyngeal, esophageal, gastrointestinal, and colorectal tissues. This led to discovery of prognostic species and blood signatures of mucosal barrier injuries, and enabled systematic matched microbe-host multi-omics analyses, which will help guide future studies of the microbiome’s role in human health and disease. 2. Information of data processing After isolating the tissue-resident population, file-level were aggregated to the sample and case level by averaging and renormalizing relative abundances. The reads- per-million (RPM) was calculated using the total number of primary reads in each sequencing run.