0% found this document useful (0 votes)
33 views

Gene Expression Databases - 525 - 2016

This document provides an overview of gene expression databases. It discusses how gene expression is measured using microarrays and RNA-seq. It then describes what gene expression databases are, examples that exist, how they differ, and how they can be used. Key points include that databases store gene expression data and provide access to datasets and individual expression levels. Integrated databases contain additional sample data and analytical tools. Specialized databases focus on specific research areas and provide pre-computed analyses to make the data more accessible and useful.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Gene Expression Databases - 525 - 2016

This document provides an overview of gene expression databases. It discusses how gene expression is measured using microarrays and RNA-seq. It then describes what gene expression databases are, examples that exist, how they differ, and how they can be used. Key points include that databases store gene expression data and provide access to datasets and individual expression levels. Integrated databases contain additional sample data and analytical tools. Specialized databases focus on specific research areas and provide pre-computed analyses to make the data more accessible and useful.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Gene

expression databases

Sean Eddy, PhD


Outline
•  How to measure gene expression?
–  Microarrays/RNA-seq
•  What are gene expression databases?
•  Which ones exist?
•  How do they differ?
•  How can they be used?
•  What needs to be taken care of?
•  What can I do with specialized databases?
Microarrays: the beginning of high
throughput gene expression
•  Compartmentalized chips with
sequences bound to the
surface
•  Sample is applied to surface
•  Hybridizes to complementary
sequence
•  HybridizaVons are quanVfied
–  No binding, “no” signal
hMp://www.nature.com/nrd/journal/v1/n12/ –  Can only find what is specifically
images/nrd961-f1.gif
searched for
–  Usually represented as n x m
matrix
RNA-seq: the future of high
throughput gene expression
•  Samples (cDNA libraries) are
applied to a sequencer
•  Millions of sequences are
generated and mapped onto a
genome
•  Sequence reads are quanVfied
–  No sequence = no expression
–  Can find novel transcripts, splice
isoforms, fusion genes, etc.
•  Quality of mapping depends
largely on library prep and
decisions made on how RNA is
iniVally processed.
Gene expression databases
•  Repositories for gene expression data
–  Mostly microarray and now RNAseq
–  Primarily for storage
–  Curated or un-curated
–  Access to data on different levels:
•  Datasets
•  Individual levels

•  Integrated databases
–  Contain array data and addiVonal data of the samples
–  Array data tends to be more annotated
–  More analyVcal tools
–  Smaller (more QC and curaVon needed)
–  O_en no direct data access
Why do they exist
•  Transparency/reproducibility of publicaVons
–  Journals require data to be available for analysis
–  Nowadays raw data is required
–  Databases offer single resource and standardized
access
•  Data was generated for a specific purpose, but is
not limited to that purpose
–  Can be reanalyzed in a different context
–  Can be combined with other datasets
–  Can be used as independent validaVon
Gene expression repository
examples
•  Gene expression omnibus (www.ncbi.nlm.nih/geo/)
–  1,117,462 samples, 3848 datasets

•  Array express (www.ebi.ac.uk/arrayexpress/)

•  Princeton University MicroArray database (PUMAdb)


–  40084 experiments, 6598 made public

•  NCBI SRA, ENA and Princeton HTseq for NGS data


What is in a gene expression
database?
•  Gene expression data in different forms:
–  ResoluVon:
•  Gene level
•  Transcript level
•  Exon level
–  And / or raw data
–  Comprehensiveness
•  Targeted arrays
•  Whole genome arrays
–  Different plaiorms (microarrays, RNAseq)
•  Generally only gene expression, may have limited
sample informaVon
Where does the data come from?
•  Expression profiles of
–  PaVents
–  Model systems
–  Cell cultures
•  Data used for publicaVon
–  Most journals now require raw data submission
–  Very coarse quality control (peer review)
–  QC depends mostly on authors
•  Datasets submiMed without publicaVon
–  LiMle or no QC
•  Most datasets are tailored towards a specific
quesVon
Example: GEO GSE32591
•  Go to hMp://www.ncbi.nlm.nih.gov/geo/
•  Enter GSE32591 into search box
•  Click on “Analyze with GEO2R”
–  How would you set up the groups for analysis?
–  What do you get?
•  Does that make sense? How can results be verified?
•  Go to “value distribuVon” tab
–  What do you see?
–  What are possible explanaVons?
What can be done with GEO?
• 
What can be done with GEO?
ProgrammaVc access for data download
–  h M p : / / w w w . n c b i . n l m . n i h . g o v / g e o / i n f o /
geo_paccess.html (GEO)
–  hMp://www.ebi.ac.uk/arrayexpress/help/
programmaVc_access.html (ArrayExpress)
•  Pre-computed analyses and on the fly
analyses
–  Search by gene across all GEO experiments
–  Search by experiment to retrieve cluster analysis
–  Search by gene sequence for matching
expression profiles
•  Described by Barret and Edgar, Methods Mol. Biol. 2006
“Mining Microarray Data at NCBI’s Gene Expression
Omnibus (GEO)”
–  hMp://www.ncbi.nlm.nih.gov/pmc/arVcles/PMC1619899/
What quesVons can be answered?
What quesVons can be answered?
•  If you download: anything
–  Only limited by your knowledge, skills, resources
•  Pre-computed results
–  Preselected analysis methods/ sample groups
–  Generally within one dataset
•  On-the-fly analyses
–  Sets of genes that cluster in under condiVons given
–  Sample properVes may not be enVrely transparent.
What can be answered by doing it
yourself?
What can be answered by doing it
yourself?
•  The quality of the data
–  Is part of the data low quality?
–  Does some of the data not fit into the set (e.g. batch
effect, outliers for other reasons)
–  Is it adequately processed?
•  What is the relaVonship between expression data
and non-expression variables?
–  How does my gene (of interest) associated with
experimental treatments, clinical parameters?
•  What are paMerns across datasets?
–  Does my finding hold up across similar analyses in
independent datasets?
Why do you have to do it yourself?
•  Quality control:
–  QC parameters are o_en glossed over in papers
and in micraorray submissions
–  For Affymetrix QC modules are available, freely
available and widely accepted in the bioinformaVc
community
–  Other array types have disVnct, but also similar
properVes
–  hMp://www.nature.com/nbt/focus/maqc/index.html

•  RelaVons to non-expression data variables


–  Data is o_en not standardized within fields
Why not?
•  Analysis across datasets:
–  Because:…. How?
–  Need to find a common standard for idenVficaVon
–  Values need to be made comparable
•  If absolute expression values used, dynamic range can
be a problem
•  Is raVos used, informaVon about expression level lost
–  Non-expression data even worse
Who is the target group for doing it
yourself?
•  Users with experience in expression data
–  Crucial informaVon (STUFF) is missing
Why is this a problem?
•  Excludes invesVgators with good hypotheses but
lacking bioinformaVc skills
How to fix that?
How to fix that?
•  Specialized databases
–  Datasets are easier to find
•  Datasets relevant to specific areas are collected in one place
–  NephroSeq for renal disease
–  Oncomine for cancer
–  Datasets are standardized and expertly curated
•  Controlled vocabulary is introduced for non-expression data
•  CuraVon of expression possible by introducing standardized
references and data transformaVons across datasets
–  Gene IDs/Gene Symbols as references
–  Z-transformaVon or median centering of log transformed expression data
Nephroseq ( www.nephroseq.org )
Oncomine (www.oncomine.com
www.oncomine.org )
NephroSeq and Oncomine
•  Pros:
–  Each focus on one area of interest
–  Clinical data for many individual samples available
–  Advanced analysis using integrated systems biology tools in a pre-defined
automated manner
–  Meta analysis possible
–  User friendly, free accessible for academic users
–  Hypotheses-genera.ng
•  Cons:
–  No raw data download
–  No programmaVc access
–  Only predefined analyzes
NephroSeq main Page
? HELP

26 datasets (2000 samples)


Analysis type

Coexpression analysis
DifferenVal analysis
Outlier analysis
Two Search OpVons
•  Gene specific search:
–  Gene

•  Dataset search:
–  Specific condiVons/diseases
NPHS2: encodes podocin,
Gene Search
a podocyte specific protein
Gene summary view

Demographics
Dataset/Disease type
Gene Search
Gene summary view

22 out of 33 analysis meet your


threshold for NPHS2 in 2 out of 2
datasets
Four Basic Analysis Modes
•  DifferenVal expression
•  Co-expression analysis
•  Outlier analysis
–  Heterogeneity within predefined groups
•  Concepts analysis
–  Gene set (Nephromine & third-party sources)
Gene Search
DifferenVal expression
(Box graph)

Reference
TubulointersVVal

Glomeruli
Gene Search
CorrelaVon with clinical conVnuous variaVon
Gene: VCAN
VCAN Analysis type: GFR
Dataset Type: Diabetes

P=7.71E-7
CorrelaVon: -0.832

Legend
1. < 15 ml/min/1.73m2 (3)
2. 15 - 29 ml/min/1.73m2 (4)
3. 30 - 59 ml/min/1.73m2 (7)
4. 60 - 89 ml/min/1.73m2 (5)
5. > 90 ml/min/1.73m2 (3)
Gene Search
Outlier analysis
Outlier analysis helps to idenVfy an expression profile where differenVal
paMern is only seen in a frac.on of samples of all paVents within a disease
type.

Why do we need it: 25% of paVents show over-expression of a gene. This
gene may not generate a significant p-value in a t-test comparing DN relaVve
to normal kidney.

How to do it: Transform all samples within a dataset, so that genes could be
ranked by their expression from high to low. The data transformaVon is
performed at certain percenVle bins (75, 90 & 95%), and a line is drawn at the
percenVle of that analysis to define outliers.

For example, in an outlier analysis at the 75th percenVle, the system draws a
line at the point at which only the top 25th percenVle samples extend above
it.
Gene Search
Outlier analysis

Controls DiabeVc
DifferenVal expression – Dataset search

Export
DifferenVal expression – dataset
search – compare analysis
•  Compare different analyzes
•  Data is standardized on upload (centered to 0 and standardized by variance)
•  all features are mapped to common idenVfier (EntrezGeneID)
Meta analysis
•  Find out which genes are significantly more
expressed in glomeruli compared to
tubulointersVVum
•  Can you verify that with another dataset?
•  Or with more than one other dataset?
•  Does it maMer if the datasets are different?
•  Can you imagine a use of this funcVonality for
an exclusive filter (NOT)
Example
Concepts Analysis
Concepts are sets of genes represenVng some aspect
of biology.

Concepts are derived from both Nephromine gene
expression signatures as well as third-party sources
such as Gene Ontology, KEGG Pathways, Human
Protein Reference Database, etc.

User can upload a self-defined custom concept (a set
of genes) to Nephromine to explore it’s associaVon
with Nephromine and third-party concepts.

Concepts Analysis
Upload Custom Concept
Manage My Concepts
Change password

Podo-50-symbol
Download list from C-tools
to the desktop, then upload

The press "validate”


Concepts Analysis
Upload

Then press “Upload”

Concept (Podo-50-symbol) validated successfully.


Concepts Analysis
Upload

Concept (Podo-50-symbol) was successfully


uploaded and can be viewed in My Concepts

Select (Podo-50-symbol) as primary concept now.


Concepts Summary View
Nephromine Concept Summary

4 concepts meet your threshold and are


associated with the primary concept

Other (Non-Nephromine) Concept Summary


Concepts Analysis

COLL FSGS vs. Normal kidney


Nephromine Gene Expression Signatures
P=1.54E-18, q=1.15E-14, Odds=18
Top 5% Under-expressed
Hodgin FSGS
Concepts Analysis
PowerPoint
PublicaVon‑quality graphic (SVG)
Excel ‑ Analysis Comparison
Excel ‑ Analysis Gene List
Excel ‑ Dataset Detail
tranSMART
The TranslaVonal Challenge: Data
IntegraVon & Analysis

Athey and Omenn, 2009


tranSMART Plaiorm:
Enabling TranslaVonal research

tranSMART –
A plaAorm and community
•  Open-source and open-
data translaVonal
biomedical research
community
•  Biomedical Researchers,
Developers, Service
Providers
•  Clinician Researchers

tranSMART Plaiorm:
Academics and industry
2012 St.
Jude,
2009 2012 Harvard,
Johnson 2010 One Johns
and Thomson Mind for Hopkins
Johnson Reuters Research Univ.

2010 2012 2012


Sage FDA Pfizer
Bionetw
orks
tranSMART: controlled vocabulary
Subset selecVon

Can further
specify with
AND or
exclusion

Subset 1 Subset 2
Summary staVsVcs 1
DifferenVally expressed genes
Gene symbols P-values Fold change

Enlarged:
Comparisons can be saved/emailed
tranSMART – why do we care?
•  Enables data exploraVon with low hurdles
•  Integrates many different data types
•  Has interfaces to real analysis tools
•  Provides a consistent data set
•  Can be run locally/ insVtuVonal etc
•  Can possibly be “shared” across insVtuVons
–  McMurry et al, PLOS one: Shrine: enabling naConally scalable MulC-
site disease studies

•  Go to: hMp://transmarioundaVon.org/
Acknowledgements

MaMhias Kretzler Daniel R. Rhodes


Felix Eichinger Rodney Keteyian
Wenjun Ju Becky Steck
SebasVan MarVni Colleen Kincaid-Beal
Viji Nair Rachel Dull
Celine Berthier
Laura Mariani
Becky Steck
Colleen Kincaid-Beal
Rachel Dull

Homework for fun
•  ConnecVvity map
–  Use Diabetes vs. control (tubulointersVVum
dataset)
–  Select top 1% overexpressed as primary concept
–  Compare to significantly overlapping concepts
with ConnecVvity map
–  Can you find potenVal drug candidates? Are there
any drugs that work for both glom. and tub?
–  What could be opVmized? How will you plan
further experiments to test your hypothesis?

You might also like