0% found this document useful (0 votes)

33 views

Gene Expression Databases - 525 - 2016

This document provides an overview of gene expression databases. It discusses how gene expression is measured using microarrays and RNA-seq. It then describes what gene expression databases are, examples that exist, how they differ, and how they can be used. Key points include that databases store gene expression data and provide access to datasets and individual expression levels. Integrated databases contain additional sample data and analytical tools. Specialized databases focus on specific research areas and provide pre-computed analyses to make the data more accessible and useful.

Uploaded by

arvind sharma matiyani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Gene Expression Databases - 525 - 2016

Uploaded by

arvind sharma matiyani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Gene

expression databases

Sean Eddy, PhD

Outline
•  How to measure gene expression?
–  Microarrays/RNA-seq
•  What are gene expression databases?
•  Which ones exist?
•  How do they differ?
•  How can they be used?
•  What needs to be taken care of?
•  What can I do with specialized databases?
Microarrays: the beginning of high
throughput gene expression
•  Compartmentalized chips with
sequences bound to the
surface
•  Sample is applied to surface
•  Hybridizes to complementary
sequence
•  HybridizaVons are quanVfied
–  No binding, “no” signal
hMp://www.nature.com/nrd/journal/v1/n12/ –  Can only find what is specifically
images/nrd961-f1.gif
searched for
–  Usually represented as n x m
matrix
RNA-seq: the future of high
throughput gene expression
•  Samples (cDNA libraries) are
applied to a sequencer
•  Millions of sequences are
generated and mapped onto a
genome
•  Sequence reads are quanVfied
–  No sequence = no expression
–  Can find novel transcripts, splice
isoforms, fusion genes, etc.
•  Quality of mapping depends
largely on library prep and
decisions made on how RNA is
iniVally processed.
Gene expression databases
•  Repositories for gene expression data
–  Mostly microarray and now RNAseq
–  Primarily for storage
–  Curated or un-curated
–  Access to data on different levels:
•  Datasets
•  Individual levels

•  Integrated databases
–  Contain array data and addiVonal data of the samples
–  Array data tends to be more annotated
–  More analyVcal tools
–  Smaller (more QC and curaVon needed)
–  O_en no direct data access
Why do they exist
•  Transparency/reproducibility of publicaVons
–  Journals require data to be available for analysis
–  Nowadays raw data is required
–  Databases offer single resource and standardized
access
•  Data was generated for a specific purpose, but is
not limited to that purpose
–  Can be reanalyzed in a different context
–  Can be combined with other datasets
–  Can be used as independent validaVon
Gene expression repository
examples
•  Gene expression omnibus (www.ncbi.nlm.nih/geo/)
–  1,117,462 samples, 3848 datasets

•  Array express (www.ebi.ac.uk/arrayexpress/)

•  Princeton University MicroArray database (PUMAdb)

–  40084 experiments, 6598 made public

•  NCBI SRA, ENA and Princeton HTseq for NGS data

What is in a gene expression
database?
•  Gene expression data in different forms:
–  ResoluVon:
•  Gene level
•  Transcript level
•  Exon level
–  And / or raw data
–  Comprehensiveness
•  Targeted arrays
•  Whole genome arrays
–  Different plaiorms (microarrays, RNAseq)
•  Generally only gene expression, may have limited
sample informaVon
Where does the data come from?
•  Expression profiles of
–  PaVents
–  Model systems
–  Cell cultures
•  Data used for publicaVon
–  Most journals now require raw data submission
–  Very coarse quality control (peer review)
–  QC depends mostly on authors
•  Datasets submiMed without publicaVon
–  LiMle or no QC
•  Most datasets are tailored towards a specific
quesVon
Example: GEO GSE32591
•  Go to hMp://www.ncbi.nlm.nih.gov/geo/
•  Enter GSE32591 into search box
•  Click on “Analyze with GEO2R”
–  How would you set up the groups for analysis?
–  What do you get?
•  Does that make sense? How can results be verified?
•  Go to “value distribuVon” tab
–  What do you see?
–  What are possible explanaVons?
What can be done with GEO?
• 
What can be done with GEO?
ProgrammaVc access for data download
–  h M p : / / w w w . n c b i . n l m . n i h . g o v / g e o / i n f o /
geo_paccess.html (GEO)
–  hMp://www.ebi.ac.uk/arrayexpress/help/
programmaVc_access.html (ArrayExpress)
•  Pre-computed analyses and on the fly
analyses
–  Search by gene across all GEO experiments
–  Search by experiment to retrieve cluster analysis
–  Search by gene sequence for matching
expression profiles
•  Described by Barret and Edgar, Methods Mol. Biol. 2006
“Mining Microarray Data at NCBI’s Gene Expression
Omnibus (GEO)”
–  hMp://www.ncbi.nlm.nih.gov/pmc/arVcles/PMC1619899/
What quesVons can be answered?
What quesVons can be answered?
•  If you download: anything
–  Only limited by your knowledge, skills, resources
•  Pre-computed results
–  Preselected analysis methods/ sample groups
–  Generally within one dataset
•  On-the-fly analyses
–  Sets of genes that cluster in under condiVons given
–  Sample properVes may not be enVrely transparent.
What can be answered by doing it
yourself?
What can be answered by doing it
yourself?
•  The quality of the data
–  Is part of the data low quality?
–  Does some of the data not fit into the set (e.g. batch
effect, outliers for other reasons)
–  Is it adequately processed?
•  What is the relaVonship between expression data
and non-expression variables?
–  How does my gene (of interest) associated with
experimental treatments, clinical parameters?
•  What are paMerns across datasets?
–  Does my finding hold up across similar analyses in
independent datasets?
Why do you have to do it yourself?
•  Quality control:
–  QC parameters are o_en glossed over in papers
and in micraorray submissions
–  For Affymetrix QC modules are available, freely
available and widely accepted in the bioinformaVc
community
–  Other array types have disVnct, but also similar
properVes
–  hMp://www.nature.com/nbt/focus/maqc/index.html

•  RelaVons to non-expression data variables

–  Data is o_en not standardized within fields
Why not?
•  Analysis across datasets:
–  Because:…. How?
–  Need to find a common standard for idenVficaVon
–  Values need to be made comparable
•  If absolute expression values used, dynamic range can
be a problem
•  Is raVos used, informaVon about expression level lost
–  Non-expression data even worse
Who is the target group for doing it
yourself?
•  Users with experience in expression data
–  Crucial informaVon (STUFF) is missing
Why is this a problem?
•  Excludes invesVgators with good hypotheses but
lacking bioinformaVc skills
How to fix that?
How to fix that?
•  Specialized databases
–  Datasets are easier to find
•  Datasets relevant to specific areas are collected in one place
–  NephroSeq for renal disease
–  Oncomine for cancer
–  Datasets are standardized and expertly curated
•  Controlled vocabulary is introduced for non-expression data
•  CuraVon of expression possible by introducing standardized
references and data transformaVons across datasets
–  Gene IDs/Gene Symbols as references
–  Z-transformaVon or median centering of log transformed expression data
Nephroseq ( www.nephroseq.org )
Oncomine (www.oncomine.com
www.oncomine.org )
NephroSeq and Oncomine
•  Pros:
–  Each focus on one area of interest
–  Clinical data for many individual samples available
–  Advanced analysis using integrated systems biology tools in a pre-defined
automated manner
–  Meta analysis possible
–  User friendly, free accessible for academic users
–  Hypotheses-genera.ng
•  Cons:
–  No raw data download
–  No programmaVc access
–  Only predefined analyzes
NephroSeq main Page
? HELP

26 datasets (2000 samples)

Analysis type

Coexpression analysis
DifferenVal analysis
Outlier analysis
Two Search OpVons
•  Gene specific search:
–  Gene

•  Dataset search:
–  Specific condiVons/diseases
NPHS2: encodes podocin,
Gene Search
a podocyte specific protein
Gene summary view

Demographics
Dataset/Disease type
Gene Search
Gene summary view

22 out of 33 analysis meet your

threshold for NPHS2 in 2 out of 2
datasets
Four Basic Analysis Modes
•  DifferenVal expression
•  Co-expression analysis
•  Outlier analysis
–  Heterogeneity within predefined groups
•  Concepts analysis
–  Gene set (Nephromine & third-party sources)
Gene Search
DifferenVal expression
(Box graph)

Reference
TubulointersVVal

Glomeruli
Gene Search
CorrelaVon with clinical conVnuous variaVon
Gene: VCAN
VCAN Analysis type: GFR
Dataset Type: Diabetes

P=7.71E-7
CorrelaVon: -0.832

Legend
1. < 15 ml/min/1.73m2 (3)
2. 15 - 29 ml/min/1.73m2 (4)
3. 30 - 59 ml/min/1.73m2 (7)
4. 60 - 89 ml/min/1.73m2 (5)
5. > 90 ml/min/1.73m2 (3)
Gene Search
Outlier analysis
Outlier analysis helps to idenVfy an expression profile where differenVal
paMern is only seen in a frac.on of samples of all paVents within a disease
type.

Why do we need it: 25% of paVents show over-expression of a gene. This
gene may not generate a significant p-value in a t-test comparing DN relaVve
to normal kidney.

How to do it: Transform all samples within a dataset, so that genes could be
ranked by their expression from high to low. The data transformaVon is
performed at certain percenVle bins (75, 90 & 95%), and a line is drawn at the
percenVle of that analysis to define outliers.

For example, in an outlier analysis at the 75th percenVle, the system draws a
line at the point at which only the top 25th percenVle samples extend above
it.
Gene Search
Outlier analysis

Controls DiabeVc
DiﬀerenVal expression – Dataset search

Export
DifferenVal expression – dataset
search – compare analysis
•  Compare different analyzes
•  Data is standardized on upload (centered to 0 and standardized by variance)
•  all features are mapped to common idenVfier (EntrezGeneID)
Meta analysis
•  Find out which genes are significantly more
expressed in glomeruli compared to
tubulointersVVum
•  Can you verify that with another dataset?
•  Or with more than one other dataset?
•  Does it maMer if the datasets are different?
•  Can you imagine a use of this funcVonality for
an exclusive filter (NOT)
Example
Concepts Analysis
Concepts are sets of genes represenVng some aspect
of biology.

Concepts are derived from both Nephromine gene
expression signatures as well as third-party sources
such as Gene Ontology, KEGG Pathways, Human
Protein Reference Database, etc.

User can upload a self-defined custom concept (a set
of genes) to Nephromine to explore it’s associaVon
with Nephromine and third-party concepts.

Concepts Analysis
Upload Custom Concept
Manage My Concepts
Change password

Podo-50-symbol
Download list from C-tools
to the desktop, then upload

The press "validate”

Concepts Analysis
Upload

Then press “Upload”

Concept (Podo-50-symbol) validated successfully.

Concepts Analysis
Upload

Concept (Podo-50-symbol) was successfully

uploaded and can be viewed in My Concepts

Select (Podo-50-symbol) as primary concept now.

Concepts Summary View
Nephromine Concept Summary

4 concepts meet your threshold and are

associated with the primary concept

Other (Non-Nephromine) Concept Summary

Concepts Analysis

COLL FSGS vs. Normal kidney

Nephromine Gene Expression Signatures
P=1.54E-18, q=1.15E-14, Odds=18
Top 5% Under-expressed
Hodgin FSGS
Concepts Analysis
PowerPoint
PublicaVon‑quality graphic (SVG)
Excel ‑ Analysis Comparison
Excel ‑ Analysis Gene List
Excel ‑ Dataset Detail
tranSMART
The TranslaVonal Challenge: Data
IntegraVon & Analysis

Athey and Omenn, 2009

tranSMART Plaiorm:
Enabling TranslaVonal research

tranSMART –
A plaAorm and community
•  Open-source and open-
data translaVonal
biomedical research
community
•  Biomedical Researchers,
Developers, Service
Providers
•  Clinician Researchers

tranSMART Plaiorm:
Academics and industry
2012 St.
Jude,
2009 2012 Harvard,
Johnson 2010 One Johns
and Thomson Mind for Hopkins
Johnson Reuters Research Univ.

2010 2012 2012

Sage FDA Pﬁzer
Bionetw
orks
tranSMART: controlled vocabulary
Subset selecVon

Can further
specify with
AND or
exclusion

Subset 1 Subset 2
Summary staVsVcs 1
DiﬀerenVally expressed genes
Gene symbols P-values Fold change

Enlarged:
Comparisons can be saved/emailed
tranSMART – why do we care?
•  Enables data exploraVon with low hurdles
•  Integrates many diﬀerent data types
•  Has interfaces to real analysis tools
•  Provides a consistent data set
•  Can be run locally/ insVtuVonal etc
•  Can possibly be “shared” across insVtuVons
–  McMurry et al, PLOS one: Shrine: enabling naConally scalable MulC-
site disease studies

•  Go to: hMp://transmarioundaVon.org/
Acknowledgements

MaMhias Kretzler Daniel R. Rhodes

Felix Eichinger Rodney Keteyian
Wenjun Ju Becky Steck
SebasVan MarVni Colleen Kincaid-Beal
Viji Nair Rachel Dull
Celine Berthier
Laura Mariani
Becky Steck
Colleen Kincaid-Beal
Rachel Dull

Homework for fun
•  ConnecVvity map
–  Use Diabetes vs. control (tubulointersVVum
dataset)
–  Select top 1% overexpressed as primary concept
–  Compare to signiﬁcantly overlapping concepts
with ConnecVvity map
–  Can you ﬁnd potenVal drug candidates? Are there
any drugs that work for both glom. and tub?
–  What could be opVmized? How will you plan
further experiments to test your hypothesis?

APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
No ratings yet
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
105 pages
Pi is 2001037024000424
No ratings yet
Pi is 2001037024000424
15 pages
TCGA gene expression data classification
No ratings yet
TCGA gene expression data classification
24 pages
Leveraging Big Data To Transform Target
No ratings yet
Leveraging Big Data To Transform Target
13 pages
I Am Sharing 'Document' With You
No ratings yet
I Am Sharing 'Document' With You
3 pages
Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics
No ratings yet
Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics
26 pages
Genome, Transcriptome and Proteome PDF
No ratings yet
Genome, Transcriptome and Proteome PDF
17 pages
Wang2019 Article MiningDataAndMetadataFromTheGe
No ratings yet
Wang2019 Article MiningDataAndMetadataFromTheGe
8 pages
Gene Expression: Quantification of Information Molecules and Their Applications
No ratings yet
Gene Expression: Quantification of Information Molecules and Their Applications
146 pages
2024.HF_BioInformatics_Lec3p
No ratings yet
2024.HF_BioInformatics_Lec3p
11 pages
WGCNA [Autosaved]
No ratings yet
WGCNA [Autosaved]
54 pages
NGS Data Sources
No ratings yet
NGS Data Sources
3 pages
Lecture2-DataMining for Bioinformatics
No ratings yet
Lecture2-DataMining for Bioinformatics
7 pages
Affy Diffexp Clustering Exercise-1
No ratings yet
Affy Diffexp Clustering Exercise-1
16 pages
Manual
No ratings yet
Manual
68 pages
BIOINFORMATICS PRACTICAL FILE
No ratings yet
BIOINFORMATICS PRACTICAL FILE
12 pages
Cancer Info
No ratings yet
Cancer Info
11 pages
Oup Accepted Manuscript 2016
No ratings yet
Oup Accepted Manuscript 2016
17 pages
BMC Bioinformatics: The Genopolis Microarray Database
No ratings yet
BMC Bioinformatics: The Genopolis Microarray Database
10 pages
bbw114 PDF
No ratings yet
bbw114 PDF
17 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
Data Mining MetaAnalysis
No ratings yet
Data Mining MetaAnalysis
39 pages
Bioinformatics Note
No ratings yet
Bioinformatics Note
7 pages
ok
No ratings yet
ok
29 pages
25-Microarray Dataset From GEO _ Download and Analysis-21!10!2024
No ratings yet
25-Microarray Dataset From GEO _ Download and Analysis-21!10!2024
21 pages
Bioinformatics
No ratings yet
Bioinformatics
55 pages
Sec1 Introduction to Bioinformatics
No ratings yet
Sec1 Introduction to Bioinformatics
20 pages
Biological Databases
No ratings yet
Biological Databases
28 pages
R NGS
No ratings yet
R NGS
29 pages
GP Report
No ratings yet
GP Report
3 pages
Module1 Understanding Bioinformatics
No ratings yet
Module1 Understanding Bioinformatics
28 pages
Introduction To R For Gene Expression Data Analysis
No ratings yet
Introduction To R For Gene Expression Data Analysis
11 pages
Using Limma For Microarray and RNA-Seq Analysis
No ratings yet
Using Limma For Microarray and RNA-Seq Analysis
13 pages
Large-Scale Analysis of Gene Expression
No ratings yet
Large-Scale Analysis of Gene Expression
27 pages
TBI Year-In-Review 2013
No ratings yet
TBI Year-In-Review 2013
91 pages
PDF Analyzing High-Dimensional Gene Expression and DNA Methylation Data with R 1st Edition Hongmei Zhang (Author) download
100% (2)
PDF Analyzing High-Dimensional Gene Expression and DNA Methylation Data with R 1st Edition Hongmei Zhang (Author) download
55 pages
BCH 505 Bioinformatics 3(2 2) Databases
No ratings yet
BCH 505 Bioinformatics 3(2 2) Databases
17 pages
Barrett 等。 - 2013 - NCBI GEO archive for functional genomics data set
No ratings yet
Barrett 等。 - 2013 - NCBI GEO archive for functional genomics data set
5 pages
RIP-Tutorials-bioinformatics
No ratings yet
RIP-Tutorials-bioinformatics
19 pages
NUS Disease Mutation Sep2022
No ratings yet
NUS Disease Mutation Sep2022
72 pages
#1 L1 BioDatabases
No ratings yet
#1 L1 BioDatabases
89 pages
Microarray Databases
No ratings yet
Microarray Databases
3 pages
Instant Download Analyzing High-Dimensional Gene Expression and DNA Methylation Data with R 1st Edition Hongmei Zhang (Author) PDF All Chapters
100% (5)
Instant Download Analyzing High-Dimensional Gene Expression and DNA Methylation Data with R 1st Edition Hongmei Zhang (Author) PDF All Chapters
65 pages
BMB402_502_Introduction_to_Bioinformatics_Syllabus_2025
No ratings yet
BMB402_502_Introduction_to_Bioinformatics_Syllabus_2025
11 pages
List of Biological Databases
No ratings yet
List of Biological Databases
9 pages
Ahmed Saad Qatea / 4 Stage
No ratings yet
Ahmed Saad Qatea / 4 Stage
10 pages
Project O: Breast Cancer Gene Analysis Using R: Sheena Scroggins, Susan Mcgowan, John Caras
No ratings yet
Project O: Breast Cancer Gene Analysis Using R: Sheena Scroggins, Susan Mcgowan, John Caras
25 pages
Databases and Ontologies
No ratings yet
Databases and Ontologies
1 page
Genetics & Genomics Resources
No ratings yet
Genetics & Genomics Resources
7 pages
Laenen Et Al. - Nucleic Acids Research - 2015 - Galahad A Web Server For Drug Effect Analysis From Gene Expression
No ratings yet
Laenen Et Al. - Nucleic Acids Research - 2015 - Galahad A Web Server For Drug Effect Analysis From Gene Expression
5 pages
Lab 1
No ratings yet
Lab 1
39 pages
Lecture 3. Dimension Reduction
No ratings yet
Lecture 3. Dimension Reduction
37 pages
COMP90016 2023 06 Data Sources
No ratings yet
COMP90016 2023 06 Data Sources
64 pages
Discovering Combinatorial Biomarkers: Vipin Kumar
No ratings yet
Discovering Combinatorial Biomarkers: Vipin Kumar
23 pages
BTH 403-BTG407 LECTURE 1
No ratings yet
BTH 403-BTG407 LECTURE 1
6 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
Functional Genomics
No ratings yet
Functional Genomics
11 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
2006 09 01 - Lect01 - ch1 2 PDF
No ratings yet
2006 09 01 - Lect01 - ch1 2 PDF
104 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet