Multi-omics infrastructure and data for R/Bioconductor

Multi-omics infrastructure and data
for R/Bioconductor
Levi Waldron
Sept 29, 2017

Why Bioconductor?
1,400 packages on a backbone of data structures
The Genomic Ranges algebra
Huber, W. et al. Orchestrating high-throughput genomic analysis with
Bioconductor. Nat. Methods 12, 115–121 (2015).
The integrative data container SummarizedExperiment

Bioconductor core data classes
• Rectangular feature x sample data
– SummarizedExperiment::SummarizedExperiment()
– (RNAseq count matrix, microarray, …)
• Genomic coordinates
– GenomicRanges::GRanges() (1-based, closed interval)
• DNA / RNA / AA sequences
– Biostrings::*Stringset()
• Gene sets
– GSEABase::GeneSet() GSEABase::GeneSetCollection()
• Single cell data
– SingleCellExperiment::SingleCellExperiment()
• Mass spec data – MSnbase::MSnExp()

Credit: Marcel Ramos
Diseases, platforms, and data types of
The TCGA
33 diseases
50 platforms
19 data types
Multi-assay experiments can be complex

The need for MultiAssayExperiment
Need a core data structure to:
– harmonize single-assay data structures
– relate multiple assays & clinical data
– handle missing and replicate observations
– accommodate ID-based and range-based data
– support on-disk representations of big data

MultiAssayExperiment design
Credit: Marcel Ramos
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).

TCGA as MultiAssayExperiments
Access from www.github.com/waldronlab/MultiAssayExperiment
…... 33 cancer types

TCGA as MultiAssayExperiments
> acc
A MultiAssayExperiment object of 9 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 9:
[1] RNASeq2GeneNorm: ExpressionSet with 20501 rows and 79 columns
[2] miRNASeqGene: ExpressionSet with 1046 rows and 80 columns
[3] CNASNP: RaggedExperiment with 79861 rows and 180 columns
[4] CNVSNP: RaggedExperiment with 21052 rows and 180 columns
[5] Methylation: SummarizedExperiment with 485577 rows and 80 columns
[6] RPPAArray: ExpressionSet with 192 rows and 46 columns
[7] Mutations: RaggedExperiment with 20166 rows and 90 columns
[8] gistica: SummarizedExperiment with 24776 rows and 90 columns
[9] gistict: SummarizedExperiment with 24776 rows and 90 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
>

The MultiAssayExperiment API
Credit:
Marcel Ramos

For building visualizations
Upset Venn diagram for adrenocortical carcinoma TCGA
> data(miniACC)
> upsetSamples(miniACC)

For multi-omics analysis
> mae <- mae[, , c("Mutations", "gistict")]
> mae <- intersectColumns(mae)
> mae$cnload <- colMeans(abs(assay(mae[["gistict"]])))
Davoli et al. Tumor aneuploidy correlates
with markers of immune evasion and
with reduced response to immunotherapy.
Science 355, (2017).

For integrating remotely stored data
> st <- ldblock::stack1kg() #Create a URL referencing 1000 genomes content in AWS S3
> multiban <- MultiAssayExperiment(
list(meth = banovichSE, snp = st),
colData = colData(banovichSE))
> multibanfocus <- multiban[rowRanges(banovichSE)[“cg04793911”], , ]
> assoc <- cisAssoc(multibanfocus[[“meth”]],
TabixFile(files(multibanfocus[[“snp”]])))
Using tabix-indexed SNP VCFs
from 1000 genomes
on Amazon S3
credit: Vince Carey

A big software engineering effort

Past curated*Data
Bioconductor packages
• curatedOvarianData
– 30 datasets, > 3K unique samples
– survival, surgical debulking, histology...
• curatedCRCData
– 34 datasets, ~4K unique samples
– many annotated for MSS, gender, stage, age, N, M
• curatedBladderData
– 12 datasets, ~1,200 unique samples
– many annotated for stage, grade, OS
14

curatedMetagenomicData: motivation
• Increasing amount of public data
• Can be fast and free, but hard to use:
– fastq files from NCBI, EBI, ...
– bioinformatic expertise
– computational resources
– manual curation / standardization
• Wanted to make acquisition of curated, ready-
to-use public data easy and reproducible
15

curatedMetagenomicData: pipeline
Download (~57TB)
Uniform processing
MetaPhlAn2 HUMAnN2
species
abundance
marker
presence
gene family
abundance
marker
abundance
metabolic pathway
abundance
metabolic pathway
presence
standardized
metadata
Manual curation
Raw
fastq files
 13 datasets
 2,875 samples
Study
metadata
Age, body site,
disease, etc…
Offline high computational load pipeline
> 120 kH CPU
Integrated Bioconductor
ExpressionSet objects
 Per-patient microbiome data
 Per-patient metadata
 Experiment-wide metadata
Integration
Automatic documentation
ExperimentHub product
 Amazon S3 cloud distribution
 Tag-based searching
 Dataset snapshot dates
 Automatic local caching
Convenience download functions
Megabytes-sized datasets
 Differential abundance
 Diversity metrics
 Clustering
 Machine learning
User
experience
https://waldronlab.github.io/curatedMetagenomicData/

One dataset from R:
> curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.stool”)
, relab=FALSE)
Many datasets from R:
> curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.*”)
Command-line:
$ curatedMetagenomicData -p "HMP_2012.metaphlan_bugs_list.*"
17
curatedMetagenomicData: use
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).

Supervised disease classification
18
Credit:
Edoardo Pasolli

Unsupervised clustering
19
Credit:
Audrey Renson

20Credit: Audrey Renson
Unsupervised clustering

Meta-
analysis
(partial) validation of
reported associations
between genera and BMI
Credit: Lucas Schiffer
Beaumont M et al. Heritable
components of the human fecal
microbiome are associated with
visceral fat. Genome Biol.
2016;17:189.

Meta-
analysis
“protective” bacteria for CRC
• Lower in stool samples of CRC
cases compared to healthy controls

curatedMetagenomicData summary
• 25 datasets (5,716 samples) available
• Six data products per dataset
• Three taxonomy-based from MetaPhlAn2
• Three functional from HUMAnN2
• Reproduce all analyses in manuscript at:
– https://waldronlab.github.io/curatedMetagenomi
cData/analyses/
• Lowest barrier to entry, highest level of
curation of any microbiome data resource
23Pasolli/Schiffer/Manghi et al., bioRxiv 103085

Future work
• Integrated databases as HDF5, indexed remote files
– fast remote slicing of ranges, genes, gene families...
• Distribute TCGA, cBioPortal through ExperimentHub
– omics and clinical data as MultiAssayExperiments
• Curated microbial signatures / BugSigDB

Thank you
• Lab (www.waldronlab.org / www.waldronlab.github.io)
– Lucas Schiffer (curatedMetagenomicData), Marcel Ramos
(MultiAssayExperiment)
– Audrey Renson, Andy Samedy, Rimsha Azar, Carmen Rodriguez,
Tiffany Chan, Abzal Bacchus, Jaya Amatya, Ludwig Geistlinger
• Collaborators
– Nicola Segata lab
• Francesco Beghini, Edoardo Passoli, Paolo Manghi
– Heidi Jones, Jennifer Dowd, Sharon Perlman, Lorna Thorpe,
Robert Burk Lab (NYC-HANES)
– Valerie Obenchain, Martin Morgan (Bioconductor core team)
• CUNY High-performance Computing Center
25

Multi-omics infrastructure and data for R/Bioconductor

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Multi-omics infrastructure and data for R/Bioconductor (20)

Recently uploaded (20)

Multi-omics infrastructure and data for R/Bioconductor

Editor's Notes