SlideShare a Scribd company logo
Multi-omics infrastructure and data
for R/Bioconductor
Levi Waldron
Sept 29, 2017
Why Bioconductor?
1,400 packages on a backbone of data structures
The Genomic Ranges algebra
Huber, W. et al. Orchestrating high-throughput genomic analysis with
Bioconductor. Nat. Methods 12, 115–121 (2015).
The integrative data container SummarizedExperiment
Bioconductor core data classes
• Rectangular feature x sample data
– SummarizedExperiment::SummarizedExperiment()
– (RNAseq count matrix, microarray, …)
• Genomic coordinates
– GenomicRanges::GRanges() (1-based, closed interval)
• DNA / RNA / AA sequences
– Biostrings::*Stringset()
• Gene sets
– GSEABase::GeneSet() GSEABase::GeneSetCollection()
• Single cell data
– SingleCellExperiment::SingleCellExperiment()
• Mass spec data – MSnbase::MSnExp()
Credit: Marcel Ramos
Diseases, platforms, and data types of
The TCGA
33 diseases
50 platforms
19 data types
Multi-assay experiments can be complex
The need for MultiAssayExperiment
Need a core data structure to:
– harmonize single-assay data structures
– relate multiple assays & clinical data
– handle missing and replicate observations
– accommodate ID-based and range-based data
– support on-disk representations of big data
MultiAssayExperiment design
Credit: Marcel Ramos
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
TCGA as MultiAssayExperiments
Access from www.github.com/waldronlab/MultiAssayExperiment
…... 33 cancer types
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
TCGA as MultiAssayExperiments
> acc
A MultiAssayExperiment object of 9 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 9:
[1] RNASeq2GeneNorm: ExpressionSet with 20501 rows and 79 columns
[2] miRNASeqGene: ExpressionSet with 1046 rows and 80 columns
[3] CNASNP: RaggedExperiment with 79861 rows and 180 columns
[4] CNVSNP: RaggedExperiment with 21052 rows and 180 columns
[5] Methylation: SummarizedExperiment with 485577 rows and 80 columns
[6] RPPAArray: ExpressionSet with 192 rows and 46 columns
[7] Mutations: RaggedExperiment with 20166 rows and 90 columns
[8] gistica: SummarizedExperiment with 24776 rows and 90 columns
[9] gistict: SummarizedExperiment with 24776 rows and 90 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
>
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
The MultiAssayExperiment API
Credit:
Marcel Ramos
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
For building visualizations
Upset Venn diagram for adrenocortical carcinoma TCGA
> data(miniACC)
> upsetSamples(miniACC)
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
For multi-omics analysis
> mae <- mae[, , c("Mutations", "gistict")]
> mae <- intersectColumns(mae)
> mae$cnload <- colMeans(abs(assay(mae[["gistict"]])))
Davoli et al. Tumor aneuploidy correlates
with markers of immune evasion and
with reduced response to immunotherapy.
Science 355, (2017).
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
For integrating remotely stored data
> st <- ldblock::stack1kg() #Create a URL referencing 1000 genomes content in AWS S3
> multiban <- MultiAssayExperiment(
list(meth = banovichSE, snp = st),
colData = colData(banovichSE))
> multibanfocus <- multiban[rowRanges(banovichSE)[“cg04793911”], , ]
> assoc <- cisAssoc(multibanfocus[[“meth”]],
TabixFile(files(multibanfocus[[“snp”]])))
Using tabix-indexed SNP VCFs
from 1000 genomes
on Amazon S3
credit: Vince Carey
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
A big software engineering effort
Past curated*Data
Bioconductor packages
• curatedOvarianData
– 30 datasets, > 3K unique samples
– survival, surgical debulking, histology...
• curatedCRCData
– 34 datasets, ~4K unique samples
– many annotated for MSS, gender, stage, age, N, M
• curatedBladderData
– 12 datasets, ~1,200 unique samples
– many annotated for stage, grade, OS
14
curatedMetagenomicData: motivation
• Increasing amount of public data
• Can be fast and free, but hard to use:
– fastq files from NCBI, EBI, ...
– bioinformatic expertise
– computational resources
– manual curation / standardization
• Wanted to make acquisition of curated, ready-
to-use public data easy and reproducible
15
curatedMetagenomicData: pipeline
Download (~57TB)
Uniform processing
MetaPhlAn2 HUMAnN2
species
abundance
marker
presence
gene family
abundance
marker
abundance
metabolic pathway
abundance
metabolic pathway
presence
standardized
metadata
Manual curation
Raw
fastq files
 13 datasets
 2,875 samples
Study
metadata
Age, body site,
disease, etc…
Offline high computational load pipeline
> 120 kH CPU
Integrated Bioconductor
ExpressionSet objects
 Per-patient microbiome data
 Per-patient metadata
 Experiment-wide metadata
Integration
Automatic documentation
ExperimentHub product
 Amazon S3 cloud distribution
 Tag-based searching
 Dataset snapshot dates
 Automatic local caching
Convenience download functions
Megabytes-sized datasets
 Differential abundance
 Diversity metrics
 Clustering
 Machine learning
User
experience
https://waldronlab.github.io/curatedMetagenomicData/
One dataset from R:
> curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.stool”)
, relab=FALSE)
Many datasets from R:
> curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.*”)
Command-line:
$ curatedMetagenomicData -p "HMP_2012.metaphlan_bugs_list.*"
17
curatedMetagenomicData: use
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
Supervised disease classification
18
Credit:
Edoardo Pasolli
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
Unsupervised clustering
19
Credit:
Audrey Renson
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
20Credit: Audrey Renson
Unsupervised clustering
Meta-
analysis
(partial) validation of
reported associations
between genera and BMI
Credit: Lucas Schiffer
Beaumont M et al. Heritable
components of the human fecal
microbiome are associated with
visceral fat. Genome Biol.
2016;17:189.
Meta-
analysis
“protective” bacteria for CRC
• Lower in stool samples of CRC
cases compared to healthy controls
curatedMetagenomicData summary
• 25 datasets (5,716 samples) available
• Six data products per dataset
• Three taxonomy-based from MetaPhlAn2
• Three functional from HUMAnN2
• Reproduce all analyses in manuscript at:
– https://waldronlab.github.io/curatedMetagenomi
cData/analyses/
• Lowest barrier to entry, highest level of
curation of any microbiome data resource
23Pasolli/Schiffer/Manghi et al., bioRxiv 103085
Future work
• Integrated databases as HDF5, indexed remote files
– fast remote slicing of ranges, genes, gene families...
• Distribute TCGA, cBioPortal through ExperimentHub
– omics and clinical data as MultiAssayExperiments
• Curated microbial signatures / BugSigDB
Thank you
• Lab (www.waldronlab.org / www.waldronlab.github.io)
– Lucas Schiffer (curatedMetagenomicData), Marcel Ramos
(MultiAssayExperiment)
– Audrey Renson, Andy Samedy, Rimsha Azar, Carmen Rodriguez,
Tiffany Chan, Abzal Bacchus, Jaya Amatya, Ludwig Geistlinger
• Collaborators
– Nicola Segata lab
• Francesco Beghini, Edoardo Passoli, Paolo Manghi
– Heidi Jones, Jennifer Dowd, Sharon Perlman, Lorna Thorpe,
Robert Burk Lab (NYC-HANES)
– Valerie Obenchain, Martin Morgan (Bioconductor core team)
• CUNY High-performance Computing Center
25

More Related Content

What's hot (20)

DOCX
Data retrieval tools
Vidya Kalaivani Rajkumar
 
PPTX
Data retreival system
Shikha Thakur
 
PPTX
Ncbi basic intro_v_pitt_kent_osu
Ben Busby
 
PPT
Biological databases: Challenges in organization and usability
Lars Juhl Jensen
 
PPT
Bioinformatics
ankitupadhyaya
 
PPT
A guided SQL tour of bioinformatics databases
Yannick Pouliot
 
PPTX
Database in bioinformatics
VinaKhan1
 
PDF
Bioinformatics biological databases
Sangeeta Das
 
PPT
Literature Based Framework for Semantic Descriptions of e-Science resources
Hammad Afzal
 
PPTX
Gen bank databases
Hafiz Muhammad Zeeshan Raza
 
PPTX
Composite protein databases
ShritilekhaDash
 
PPTX
Biological Database
Sombir Kashyap
 
PPT
Primary and secondary database
KAUSHAL SAHU
 
PPTX
Introduction to NCBI
geetikaJethra
 
PPTX
databases in bioinformatics
nadeem akhter
 
PDF
Tools and database of NCBI
Santosh Kumar Sahoo
 
PPTX
(Expasy)
Mazhar Khan
 
PPTX
Primary and secondary databases ppt by puneet kulyana
Puneet Kulyana
 
Data retrieval tools
Vidya Kalaivani Rajkumar
 
Data retreival system
Shikha Thakur
 
Ncbi basic intro_v_pitt_kent_osu
Ben Busby
 
Biological databases: Challenges in organization and usability
Lars Juhl Jensen
 
Bioinformatics
ankitupadhyaya
 
A guided SQL tour of bioinformatics databases
Yannick Pouliot
 
Database in bioinformatics
VinaKhan1
 
Bioinformatics biological databases
Sangeeta Das
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Hammad Afzal
 
Gen bank databases
Hafiz Muhammad Zeeshan Raza
 
Composite protein databases
ShritilekhaDash
 
Biological Database
Sombir Kashyap
 
Primary and secondary database
KAUSHAL SAHU
 
Introduction to NCBI
geetikaJethra
 
databases in bioinformatics
nadeem akhter
 
Tools and database of NCBI
Santosh Kumar Sahoo
 
(Expasy)
Mazhar Khan
 
Primary and secondary databases ppt by puneet kulyana
Puneet Kulyana
 

Viewers also liked (20)

PPTX
PO WER - XX LO Gdańsk - Alan Turing
Agnieszka J.
 
PDF
Jupyter, A Platform for Data Science at Scale
Matthias Bussonnier
 
PDF
MongoDB - Big Data mit Open Source
B1 Systems GmbH
 
PPT
Apps for Science - Elsevier Developer Network Workshop 201102
remko caprio
 
PPTX
Computational Biology and Bioinformatics
Sharif Shuvo
 
PDF
The Computer Scientist and the Cleaner v4
turingfan
 
PDF
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Rising Media Ltd.
 
PPTX
COMPUTATIONAL BIOLOGY
Krupali Gandhi
 
PDF
Computational Approaches to Systems Biology
Mike Hucka
 
PPT
Alan Turing Scientist Unlimited | Turing100@Persistent Systems
Persistent Systems Ltd.
 
PPT
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen
 
PDF
Do you know what k-Means? Cluster-Analysen
Harald Erb
 
PDF
Zwischen Browser, Code & Photoshop - aus dem Leben eines Webworkers
G + P Glanzer + Partner Werbeagentur GmbH
 
PDF
Python for Data Science
Gabriel Moreira
 
PDF
IBM - Big Value from Big Data
Wilfried Hoge
 
PPTX
Tutorial 1: Your First Science App - Araport Developer Workshop
Vivek Krishnakumar
 
PDF
Day in the Life of a Computer Scientist
Justin Brunelle
 
PPT
Systems biology: Bioinformatics on complete biological system
Lars Juhl Jensen
 
PDF
Data Scientist - The Sexiest Job of the 21st Century?
IoT User Group Hamburg
 
PPTX
Computational Systems Biology (JCSB)
Annex Publishers
 
PO WER - XX LO Gdańsk - Alan Turing
Agnieszka J.
 
Jupyter, A Platform for Data Science at Scale
Matthias Bussonnier
 
MongoDB - Big Data mit Open Source
B1 Systems GmbH
 
Apps for Science - Elsevier Developer Network Workshop 201102
remko caprio
 
Computational Biology and Bioinformatics
Sharif Shuvo
 
The Computer Scientist and the Cleaner v4
turingfan
 
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Rising Media Ltd.
 
COMPUTATIONAL BIOLOGY
Krupali Gandhi
 
Computational Approaches to Systems Biology
Mike Hucka
 
Alan Turing Scientist Unlimited | Turing100@Persistent Systems
Persistent Systems Ltd.
 
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen
 
Do you know what k-Means? Cluster-Analysen
Harald Erb
 
Zwischen Browser, Code & Photoshop - aus dem Leben eines Webworkers
G + P Glanzer + Partner Werbeagentur GmbH
 
Python for Data Science
Gabriel Moreira
 
IBM - Big Value from Big Data
Wilfried Hoge
 
Tutorial 1: Your First Science App - Araport Developer Workshop
Vivek Krishnakumar
 
Day in the Life of a Computer Scientist
Justin Brunelle
 
Systems biology: Bioinformatics on complete biological system
Lars Juhl Jensen
 
Data Scientist - The Sexiest Job of the 21st Century?
IoT User Group Hamburg
 
Computational Systems Biology (JCSB)
Annex Publishers
 
Ad

Similar to Multi-omics infrastructure and data for R/Bioconductor (20)

PPTX
Multi-omics methods and resources for Bioconductor
Levi Waldron
 
PPTX
Accelerate Pharmaceutical R&D with Big Data and MongoDB
MongoDB
 
PPTX
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
PDF
Integrating Omics Data 1st Edition George C. Tseng
dhaletugce
 
PDF
Integrating Omics Data 1st Edition George C. Tseng
mucinniamey
 
PDF
Lightweight data engineering, tools, and software to facilitate data reuse an...
Sean Davis
 
PDF
User-friendly bioinformatics (Monthly Informational workshop)
Elia Brodsky
 
PDF
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Nils Gehlenborg
 
PPTX
Data analysis & integration challenges in genomics
mikaelhuss
 
PDF
zandona14nipsA0
Pia Sen
 
PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
Gabe Rudy
 
PDF
Integrating Omics Data 1st Edition George C Tseng Debashis Ghosh
rossunsufane
 
PDF
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 
PDF
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Data Driven Innovation
 
PPTX
metagenomics.pptx
royshikha
 
PPTX
2018 05 24-waldron-itcr
Levi Waldron
 
PPTX
Using ontologies to do integrative systems biology
Chris Evelo
 
PDF
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
Genomika Diagnósticos
 
PPTX
Complementing Computation with Visualization in Genomics
Francis Rowland
 
PPT
ProFET - Protein Feature Engineering Toolki
Dan Ofer
 
Multi-omics methods and resources for Bioconductor
Levi Waldron
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
MongoDB
 
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
Integrating Omics Data 1st Edition George C. Tseng
dhaletugce
 
Integrating Omics Data 1st Edition George C. Tseng
mucinniamey
 
Lightweight data engineering, tools, and software to facilitate data reuse an...
Sean Davis
 
User-friendly bioinformatics (Monthly Informational workshop)
Elia Brodsky
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Nils Gehlenborg
 
Data analysis & integration challenges in genomics
mikaelhuss
 
zandona14nipsA0
Pia Sen
 
CS Lecture 2017 04-11 from Data to Precision Medicine
Gabe Rudy
 
Integrating Omics Data 1st Edition George C Tseng Debashis Ghosh
rossunsufane
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Data Driven Innovation
 
metagenomics.pptx
royshikha
 
2018 05 24-waldron-itcr
Levi Waldron
 
Using ontologies to do integrative systems biology
Chris Evelo
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
Genomika Diagnósticos
 
Complementing Computation with Visualization in Genomics
Francis Rowland
 
ProFET - Protein Feature Engineering Toolki
Dan Ofer
 
Ad

Recently uploaded (20)

PPTX
Wondershare Filmora Crack 14.5.18 + Key Full Download [Latest 2025]
HyperPc soft
 
PPTX
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
PDF
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
PDF
WholeClear Split vCard Software for Split large vCard file
markwillsonmw004
 
PPTX
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PDF
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
PPTX
CONCEPT OF PROGRAMMING in language .pptx
tamim41
 
PDF
Rewards and Recognition (2).pdf
ethan Talor
 
PPT
Information Communication Technology Concepts
LOIDAALMAZAN3
 
PPTX
IObit Driver Booster Pro Crack Download Latest Version
chaudhryakashoo065
 
PPTX
How Can Recruitment Management Software Improve Hiring Efficiency?
HireME
 
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
 
PDF
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
PPTX
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
PDF
Laboratory Workflows Digitalized and live in 90 days with Scifeon´s SAPPA P...
info969686
 
PPTX
ERP - FICO Presentation BY BSL BOKARO STEEL LIMITED.pptx
ravisranjan
 
PDF
AI Software Development Process, Strategies and Challenges
Net-Craft.com
 
PPTX
declaration of Variables and constants.pptx
meemee7378
 
PPTX
Iobit Driver Booster Pro 12 Crack Free Download
chaudhryakashoo065
 
Wondershare Filmora Crack 14.5.18 + Key Full Download [Latest 2025]
HyperPc soft
 
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
WholeClear Split vCard Software for Split large vCard file
markwillsonmw004
 
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
CONCEPT OF PROGRAMMING in language .pptx
tamim41
 
Rewards and Recognition (2).pdf
ethan Talor
 
Information Communication Technology Concepts
LOIDAALMAZAN3
 
IObit Driver Booster Pro Crack Download Latest Version
chaudhryakashoo065
 
How Can Recruitment Management Software Improve Hiring Efficiency?
HireME
 
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
 
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
Laboratory Workflows Digitalized and live in 90 days with Scifeon´s SAPPA P...
info969686
 
ERP - FICO Presentation BY BSL BOKARO STEEL LIMITED.pptx
ravisranjan
 
AI Software Development Process, Strategies and Challenges
Net-Craft.com
 
declaration of Variables and constants.pptx
meemee7378
 
Iobit Driver Booster Pro 12 Crack Free Download
chaudhryakashoo065
 

Multi-omics infrastructure and data for R/Bioconductor

  • 1. Multi-omics infrastructure and data for R/Bioconductor Levi Waldron Sept 29, 2017
  • 2. Why Bioconductor? 1,400 packages on a backbone of data structures The Genomic Ranges algebra Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015). The integrative data container SummarizedExperiment
  • 3. Bioconductor core data classes • Rectangular feature x sample data – SummarizedExperiment::SummarizedExperiment() – (RNAseq count matrix, microarray, …) • Genomic coordinates – GenomicRanges::GRanges() (1-based, closed interval) • DNA / RNA / AA sequences – Biostrings::*Stringset() • Gene sets – GSEABase::GeneSet() GSEABase::GeneSetCollection() • Single cell data – SingleCellExperiment::SingleCellExperiment() • Mass spec data – MSnbase::MSnExp()
  • 4. Credit: Marcel Ramos Diseases, platforms, and data types of The TCGA 33 diseases 50 platforms 19 data types Multi-assay experiments can be complex
  • 5. The need for MultiAssayExperiment Need a core data structure to: – harmonize single-assay data structures – relate multiple assays & clinical data – handle missing and replicate observations – accommodate ID-based and range-based data – support on-disk representations of big data
  • 6. MultiAssayExperiment design Credit: Marcel Ramos Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 7. TCGA as MultiAssayExperiments Access from www.github.com/waldronlab/MultiAssayExperiment …... 33 cancer types Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 8. TCGA as MultiAssayExperiments > acc A MultiAssayExperiment object of 9 listed experiments with user-defined names and respective classes. Containing an ExperimentList class object of length 9: [1] RNASeq2GeneNorm: ExpressionSet with 20501 rows and 79 columns [2] miRNASeqGene: ExpressionSet with 1046 rows and 80 columns [3] CNASNP: RaggedExperiment with 79861 rows and 180 columns [4] CNVSNP: RaggedExperiment with 21052 rows and 180 columns [5] Methylation: SummarizedExperiment with 485577 rows and 80 columns [6] RPPAArray: ExpressionSet with 192 rows and 46 columns [7] Mutations: RaggedExperiment with 20166 rows and 90 columns [8] gistica: SummarizedExperiment with 24776 rows and 90 columns [9] gistict: SummarizedExperiment with 24776 rows and 90 columns Features: experiments() - obtain the ExperimentList instance colData() - the primary/phenotype DataFrame sampleMap() - the sample availability DataFrame `$`, `[`, `[[` - extract colData columns, subset, or experiment *Format() - convert into a long or wide DataFrame assays() - convert ExperimentList to a SimpleList of matrices > Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 9. The MultiAssayExperiment API Credit: Marcel Ramos Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 10. For building visualizations Upset Venn diagram for adrenocortical carcinoma TCGA > data(miniACC) > upsetSamples(miniACC) Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 11. For multi-omics analysis > mae <- mae[, , c("Mutations", "gistict")] > mae <- intersectColumns(mae) > mae$cnload <- colMeans(abs(assay(mae[["gistict"]]))) Davoli et al. Tumor aneuploidy correlates with markers of immune evasion and with reduced response to immunotherapy. Science 355, (2017). Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 12. For integrating remotely stored data > st <- ldblock::stack1kg() #Create a URL referencing 1000 genomes content in AWS S3 > multiban <- MultiAssayExperiment( list(meth = banovichSE, snp = st), colData = colData(banovichSE)) > multibanfocus <- multiban[rowRanges(banovichSE)[“cg04793911”], , ] > assoc <- cisAssoc(multibanfocus[[“meth”]], TabixFile(files(multibanfocus[[“snp”]]))) Using tabix-indexed SNP VCFs from 1000 genomes on Amazon S3 credit: Vince Carey Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 13. A big software engineering effort
  • 14. Past curated*Data Bioconductor packages • curatedOvarianData – 30 datasets, > 3K unique samples – survival, surgical debulking, histology... • curatedCRCData – 34 datasets, ~4K unique samples – many annotated for MSS, gender, stage, age, N, M • curatedBladderData – 12 datasets, ~1,200 unique samples – many annotated for stage, grade, OS 14
  • 15. curatedMetagenomicData: motivation • Increasing amount of public data • Can be fast and free, but hard to use: – fastq files from NCBI, EBI, ... – bioinformatic expertise – computational resources – manual curation / standardization • Wanted to make acquisition of curated, ready- to-use public data easy and reproducible 15
  • 16. curatedMetagenomicData: pipeline Download (~57TB) Uniform processing MetaPhlAn2 HUMAnN2 species abundance marker presence gene family abundance marker abundance metabolic pathway abundance metabolic pathway presence standardized metadata Manual curation Raw fastq files  13 datasets  2,875 samples Study metadata Age, body site, disease, etc… Offline high computational load pipeline > 120 kH CPU Integrated Bioconductor ExpressionSet objects  Per-patient microbiome data  Per-patient metadata  Experiment-wide metadata Integration Automatic documentation ExperimentHub product  Amazon S3 cloud distribution  Tag-based searching  Dataset snapshot dates  Automatic local caching Convenience download functions Megabytes-sized datasets  Differential abundance  Diversity metrics  Clustering  Machine learning User experience https://waldronlab.github.io/curatedMetagenomicData/
  • 17. One dataset from R: > curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.stool”) , relab=FALSE) Many datasets from R: > curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.*”) Command-line: $ curatedMetagenomicData -p "HMP_2012.metaphlan_bugs_list.*" 17 curatedMetagenomicData: use Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
  • 18. Supervised disease classification 18 Credit: Edoardo Pasolli Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
  • 19. Unsupervised clustering 19 Credit: Audrey Renson Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
  • 21. Meta- analysis (partial) validation of reported associations between genera and BMI Credit: Lucas Schiffer Beaumont M et al. Heritable components of the human fecal microbiome are associated with visceral fat. Genome Biol. 2016;17:189.
  • 22. Meta- analysis “protective” bacteria for CRC • Lower in stool samples of CRC cases compared to healthy controls
  • 23. curatedMetagenomicData summary • 25 datasets (5,716 samples) available • Six data products per dataset • Three taxonomy-based from MetaPhlAn2 • Three functional from HUMAnN2 • Reproduce all analyses in manuscript at: – https://waldronlab.github.io/curatedMetagenomi cData/analyses/ • Lowest barrier to entry, highest level of curation of any microbiome data resource 23Pasolli/Schiffer/Manghi et al., bioRxiv 103085
  • 24. Future work • Integrated databases as HDF5, indexed remote files – fast remote slicing of ranges, genes, gene families... • Distribute TCGA, cBioPortal through ExperimentHub – omics and clinical data as MultiAssayExperiments • Curated microbial signatures / BugSigDB
  • 25. Thank you • Lab (www.waldronlab.org / www.waldronlab.github.io) – Lucas Schiffer (curatedMetagenomicData), Marcel Ramos (MultiAssayExperiment) – Audrey Renson, Andy Samedy, Rimsha Azar, Carmen Rodriguez, Tiffany Chan, Abzal Bacchus, Jaya Amatya, Ludwig Geistlinger • Collaborators – Nicola Segata lab • Francesco Beghini, Edoardo Passoli, Paolo Manghi – Heidi Jones, Jennifer Dowd, Sharon Perlman, Lorna Thorpe, Robert Burk Lab (NYC-HANES) – Valerie Obenchain, Martin Morgan (Bioconductor core team) • CUNY High-performance Computing Center 25

Editor's Notes

  • #2: Hi. I’d like to introduce you to MultiAssayExperiment, a framework for the representation and analysis of multi-omics experiments in Bioconductor.
  • #3: For anyone unfamiliar with Bioconductor, it is a suite of over a thousand packages for statistical analysis and visualization of high-throughput biological data, accessible via the R programming language and unified by a backbone of core data structures designed for the requirements of specific genomic data types. * Core developers provide this key set of data structures that are efficient and well tested, and contributed packages are expected to use these where applicable For example, the Genomic Ranges system provides a representation and algebra for any data associated with genomic coordinates. Efficient in-memory and on-disk representations Integrative data containers such as SummarizedExperiment, integrate high-throughput data with, for example, gene annotations, sample data such as clinical information, experimental metadata, and can even represent multiple assays. In this case, however, the assays must be matrix-like and of identical dimensions Until now, Bioconductor was lacking in a core data structure to provide a framework for analysis and development of tools for multi-omics experiments
  • #5: This work was motivated by the need to simplify general statistical analysis and development of bioinformatic tools for a study as complex as the Cancer Genome Atlas, where 33 cancers were assayed on many platforms to generate different types of data, but also to provide a simplified framework for more easily reproducible and less error-prone analysis of simpler experiments involving just a couple of complementary assays and clinical data.
  • #6: A core data structure was needed to * harmonize existing structures for different types of data, * relate assays with each other and clinical data * handle the reality that such experiments are often incomplete and missing observations on some assays, and also may contain replicates, time series, or matched normal, * accommodate data that are indexed by IDs such as genes and data indexed by genomic ranges, * and support on-disk representations for big data
  • #7: MultiAssayExperiment addresses these challenges by relating a table of information about subjects, say clinical and pathological data, to a series of genomic data sets of arbitrary shape and even non-tabular data, via a map or a network relating these. This sounds complex and it can be, but from the analyst’s perspective, there is an API that will be familiar to users of R, and that abstracts this complexity from the user. Constructing, accessing, subsetting, data management or manipulation, and combining and reshaping into forms usable by standard tools become straightforward.
  • #10: To help those wanting to analyze TCGA data, we’ve constructed MultiAssayExperiments for 33 cancer types. Each cancer type is represented by a single object containing all the most commonly used, unrestricted data. These objects are immediately usable, even on most laptops, with the API shown on the previous page.
  • #11: To give you an idea of what this looks like, here is a sort of complex Venn Diagram of just four of the assays for GBM. Although GISTIC copy number and microRNA are assayed on about 600 samples each, but only a fraction of these cases have data available for both, and an even smaller fraction have data for all four of GISTIC, microRNA, methylation, and RNA-seq data.
  • #12: Analyses of a single assay or that combine assays, such as this reproduction of the result from Davoli et al. that cancer types with high levels of aneuploidy often show a positive correlation of mutation load and chromosomal instability, perhaps due to a higher tolerance of deleterious mutations, as shown here in orange for breast cancer. Whereas, tumors with a hypermutator phenotype rarely display extensive chromosomal instability, resulting in a negative correlation of mutation load and chromosomal instability in cancer types where hypermutation is common (shown in grey for colon adenocarcinoma).
  • #13: Larger files, such as SNPs in VCF format, demonstrated here from the 1000 genomes project because this is unrestricted data, can be analyzed for example in this SNP/methylation association study, in chunks from an on-disk representation. This data format, by the way, was supported by default without any modification of MultiAssayExperiment, as is any data class meeting a few minimum requirements.
  • #17: https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Funnel_Mech.svg/667px-Funnel_Mech.svg.png https://pixabay.com/en/cheering-happy-jumping-people-297419/
  • #26: met Jin Xu from East China Normal University, Shanghai