0% found this document useful (0 votes)
12 views

Gene Expression Analysis: Ulf Leser and Karin Zimmermann

Gene expression analysis

Uploaded by

Ana Calmîș
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Gene Expression Analysis: Ulf Leser and Karin Zimmermann

Gene expression analysis

Uploaded by

Ana Calmîș
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Gene expression analysis

Ulf Leser and Karin Zimmermann


Ulf Leser: Bioinformatics, Wintersemester 2010/2011 1
Last lecture

What are microarrays? - Biomolecular devices measuring the transcriptome


of a cell of interest.

Workflow of a microarray experiment - RNA extraction, cDNA rewriting, labeling,


hybridization to microarray, scanning, spot detection, spot intensity to numeric values,
normalization, analysis (today)

Normalization – Assumption, that the vast majority of genes is not differentially


expressed between the two classes. Remove technical bias to detect the
biological differences.

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 2


This lecture

Differential expression
Clustering
Standards in the gene expression data management
Databases

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 3


Differential Expression - Motivation

Why find genes that behave differently in two classes (e.g. normal and tumor)?

Better understanding of the genetic circumstances that cause the difference


(disease) hopefully leads to better therapy.

Detection of marker-genes enables the early recognition of diseases as well as


the recognition of subtypes of diseases.

Once a cause is identified therapy can become more specific, more effective
and reduce side-effects.

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 4


Differential Expression

Sample
We have:

N1,...,Nm: normale samples


T1,...,Tn: tumor samples

We look for: genes with significant differences


between N and T

Compare values of gene X from group N


with those of group T
N = {n1,...,nm}
T = {t1,...,tn}

many methods, here:


Fold change
t-test
Gene

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 5


Visualization - Scatterplot

one point = one gene


Sample 1

Sample 1

Sample 1
Sample 2 Sample 2 Sample 2

totally identical distribution of outlier:


distribution intensity interesting
differences genes

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 6


Fold Change

 avg (T ) 
log 2  
Definition Fold Change (FC):
 avg ( N ) 
2
Significance of result is determined by threshold fc:

fc < 2 not interesting


2 < fc < 4 interesting
fc > 4 very interesting

Why log2 ?
mean(tumor) mean(normal) mean(t) / FC
mean(n)
gene x 16 1 16 16
gene y 0.0624 1 1/16 16

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 7


Fold Change– Advantages / Disadvantages

+ intuitive measure
- independent of scatter
Exp Exp

S
- independent of absolut values

Exp Exp

2-fold

2-fold

→ score based only on the mean of the groups not optimal, include variance!
Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 8
T-test – Hypothesis testing

Hypothesis
H0 Null hypothesis (the one we want to reject)
H1 Alternative hypothesis (logical opposite of H0)

Test statistic
Function of the sample that summarizes the characteristics of the latter
into one number with a known distribution.

Significance level
Probability for a false positive outcome of the test,
the error of rejecting a null hypothesis when it is actually true

P-Value
Probability of obtaining the observed test-statistic or higher under
the assumption, that the null hypothesis holds.

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 9


Hypothesis testing – p value

p value/2
p value/2

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 10


T-test (Welch-test)

Assumption: The values are normally distributed (note that for the normal t-test
equal variances are assumed)

mean( N ) − mean(T )
Teststatistik: t=
sd ( N ) 2 sd (T ) 2
+
m n
the greater | t |, the greater the differential expression of gene X .

From t statistic to p value: t-value and significance level determine the p value
(look-up tables)

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 11


Example

N = { 5,7,6,9,5} T = { 2,4,3,5,3}

Hypothesis H1 : µ N − µ T ≠ 0 H0:µ N − µ T = 0

Significance level α = 0.05


mean( N ) − mean(T )
Test statistic t= = 3.3129
2 2
sd ( N ) sd (T )
+
m n
P-Value p − value = 0.0126

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 12


Example

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 13


Further Methods

ANOVA – comparing more than one group as well as


different factors.

SAM – Significance analysis of Microarrays. An


'improvement' of the t-test, as small variances can lead to
very significant results without a considerable fold change.

Rank Produkt – sort genes by expression and determine


Geometric mean of rank.

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 14


Multiple Testing Correction

Problem: Microarrays contain up to 20 000 genes, thus an α=0.05


leads to 20 000 * 0.05 = 1000 FPs.

Solution: Multiple testing correction. Two basic approaches:

1. Family wise error rate (FWER) , the probability of having at


least one false positive in the set of results considered
as significant.
2. False discovery rate (FDR), the expected proportion of true
null hypotheses rejected in the total number of
rejections.(FDR measures the expected proportion of incorrectly
rejected null hypotheses, i.e. type I errors).

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 15


Bonferoni (FWER)

Let N be the number of genes tested and p the p-value of a given probe,
one computes an adjusted p-value using:

padjusted = p*N

Only if the adjusted p-value is smaller than the pre-chosen significance


value, the probe is considered differentially expressed.

Very conservative test, rarely used in practice.

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 16


Benjamini – Hochberg (FDR)

1. choose a specific α (e.g. α=0.05)

2. rank all m p-values from smallest to largest

3. correct all p-values: BH(pi)i=1,...,m = pi * m/i

4. BH (p) = significant if BH(p) ≤ α

Genes p-value rank BH(p) Significant?


(α=0.05)
Gene A 0.00001 1 1000/1*0.00001=0.01 yes

Gene B 0.0004 2 1000/2*0.0004=0.02 yes

Gene C 0.01 3 1000/3*0.01=3.33 no

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 17


Clustering - Motivation

High dimensional data possibly containing all kinds of patterns and


behavior of subgroups which might represent biolmedical phenomena.
(explorative)

Clustering for quality control.

Expression patterns similar in spacial and temporal


behavior → co-regulated / expressed genes (e.g. genes
controlled by the same transkriptionfactor).

Discover new disease subtypes by clustering samples.

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 18


Clustering

Ramaswamy
& Golub 2002

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 19


Clustering - Overwiev

Classification Clustering
(Supervised learning) (Unsupervised learning)

SVM Bayes classifier KNN hierarchical k-means SOM

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 20


Clustering - Overwiev

Classification Clustering
(Supervised learning) (Unsupervised learning)

SVM Bayes classifier KNN hierarchical k-means SOM

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 21


Clustering - Example

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 22


Hierarchical Clustering - Algorithm

1. choose a distance measure (e.g. euclidean, Pearson, etc.)


2. compute similarity matrix S
3. compute all pairwise distances in the matrix
4. while |S|>1
5. determine pair (X,Y) with minimal distance
6. compute new value Z = avg (X,Y), (single, average, or complete linkage)
7. delete X and Y in S, insert Z in S
8. compute new distances of Z to all elements in S
9. visualize X and Y as pair

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 23


Hierarchical Clustering - graphical

A ABCDEFG A ACEFGa
B A B A
C B. C C.
C..
D D... (B,D)→ a D E.. (E,F)→ b
E E.... E F...
F F..... F G....
G G...... G a.....

A ACGab A
B A B CGac
C C. C C
D G.. (A,b)→ c D G. (C,G)→ d
E a... E a..
F b.... F c...
G G

A A A
B acd B B
C a C ae C
D c. (d,c)→ e D a (a,e)→ f D
E d.. E e. E
F F F
G G G

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 24


Hierarchical Clustering – real data

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 25


HC

Result: binary tree, clusters have to be determined by the user.

For a easier determination of clusters: length of branch is set in relation to the difference of the
leafs.

The quality of the clustering can (then) be determined by the ratio of the mean distance in the
cluster to the mean distance to points not in the cluster. Can be used as a measure for the
cluster borders.

Dendrogram not unambiguous, 2n possibilities. An O(n4) algorithm is known to optimize the


dendrogram.

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 26


K means

1. choose k random cluster centers μ1,...μk.


2. for all x in the dataset S compute nearest cluster center
3. for all Clusters Ci compute its cost:
cost(Ci)=∑r=1...|Ci|(d(μi,xr,i))
4. compute a new center μi for every cluster Ci
c(Ci)=1/|Ci|∑r=1|Ci|xri

5. repeat 2.-3. until cluster centers do not change

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 27


K means

http://www.itee.uq.edu.au/~comp4702/lectures/k-means_bis_1.jpg

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 28


K means

Convergence is not assured.

Cluster quality can be computed by determining the mean distance of a


gene to its clustercenters for all clusters.

Number of clusters has to be chosen in advance.

The initialization of the cluster centers has a great impact on the


clustering quality, compute more than one initial constellation

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 29


Standards

To determine the comparability of different experiments detailed information on the


different steps is necessary.

RNA extraction,
cDNA rewriting,
labeling,
hybridization to microarray,
scanning,
spot detection,
spot intensity to numeric values,
normalization

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 30


MIAME

MIAME describes the Minimum Information About a Microarray


Experiment that is needed to enable the interpretation of the
results of the experiment unambiguously and potentially to
reproduce the experiment.

MIAME does not specify a particular format (→ use MAGE-TAB or


MAGE-ML)

MIAME does not specify any particular terminology (use MGED-


ontology)

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 31


MIAME Specification

1. raw data (.CEL, .gpr)

2. final processed (normalized) data

3. sample annotation (incl. Experimental factors and their values)

4. experimental design including sample data relationships


(e.g.,hybridisations technical or biological replicates)

5. annotation of the array (e.g., gene identifiers, genomic coordinates,


probe oligonucleotide sequences )

6. laboratory and data processing protocols (e.g., what


normalisation method)

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 32


Standards - Overview

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 33


Standards - Overview

DNA High- In Situ Hy- Tissue Proteomics


Microarray throughput bridization Microarray Data
Data Sequencing and Im- Data
Data munohisto-
chemistry
Data

Minimum MIAME MINSEQE MISFISHIE ??? MAIPE


Information
Specifi-
cation
Data Model MAGE-OM ? ? TMA-OM PSI-OM
XML format MAGE-ML ? ? TMA-DES PSI-ML

TAB-del. MAGE-TAB ? ? TMA-TAB ?


format

Controlled MGED- ? ? ? ?
vocabulary ontology

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 34


Databases

GEO (Gene Expression Omnibus)


Array Express

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 35


GEO – Gene Expression Omnibus

NCBI public repository


RDBMS schema

GSM
GSE GDS
GPL raw-processed
grouping of chip data, grouping of
platform description intensities from a
a single experiment experiments
single or chip

submitted by submitted by curated by


manufacturer experimentalist NCBI

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 36


GEO

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 37


GEO

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 38


ArrayExpress (EMBL-EBI)

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 39


GEO vs. ArrayExpress

- both encompass MIAME compliance

- both provide a good possibility for making data publicly


availabe as often requested by journals

- GEO contains more data

- ArrayExpress provides analysis tools (and seq data?)

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 40


DLBCL Subtypes

germinal center B-cell-like (GCB), activated B-cell-like (ABC)


with 5-year survival rates of 59% and 30%

Wright 2003

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 41


DLBCL Subtypes

Wright 2003

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 42


DLBCL Subtypes

40 Exon arrays of DLBCL patients, subtype unknown.


Do we see the division in subgroups with a different
technology and different probes?

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 43


DLBCL Subtypes

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 44


DLBCL Subtypes

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 45


Summary

Combine t-test and fold change for optimal detection of


differential expression.

More explorative analysis like clustering can detect patterns


inherent in the expression data like co-regulated genes or
new disease subtypes.

Public repositories like GEO and ArrayExpress offer a rich


fundus of data.

Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 46

You might also like