Gene Expression Analysis: Ulf Leser and Karin Zimmermann
Gene Expression Analysis: Ulf Leser and Karin Zimmermann
Differential expression
Clustering
Standards in the gene expression data management
Databases
Why find genes that behave differently in two classes (e.g. normal and tumor)?
Once a cause is identified therapy can become more specific, more effective
and reduce side-effects.
Sample
We have:
Sample 1
Sample 1
Sample 2 Sample 2 Sample 2
avg (T )
log 2
Definition Fold Change (FC):
avg ( N )
2
Significance of result is determined by threshold fc:
Why log2 ?
mean(tumor) mean(normal) mean(t) / FC
mean(n)
gene x 16 1 16 16
gene y 0.0624 1 1/16 16
+ intuitive measure
- independent of scatter
Exp Exp
S
- independent of absolut values
Exp Exp
2-fold
2-fold
→ score based only on the mean of the groups not optimal, include variance!
Ulf Leser and Karin Zimmermann: Bioinformatics, Wintersemester 2010/2011 8
T-test – Hypothesis testing
Hypothesis
H0 Null hypothesis (the one we want to reject)
H1 Alternative hypothesis (logical opposite of H0)
Test statistic
Function of the sample that summarizes the characteristics of the latter
into one number with a known distribution.
Significance level
Probability for a false positive outcome of the test,
the error of rejecting a null hypothesis when it is actually true
P-Value
Probability of obtaining the observed test-statistic or higher under
the assumption, that the null hypothesis holds.
p value/2
p value/2
Assumption: The values are normally distributed (note that for the normal t-test
equal variances are assumed)
mean( N ) − mean(T )
Teststatistik: t=
sd ( N ) 2 sd (T ) 2
+
m n
the greater | t |, the greater the differential expression of gene X .
From t statistic to p value: t-value and significance level determine the p value
(look-up tables)
N = { 5,7,6,9,5} T = { 2,4,3,5,3}
Hypothesis H1 : µ N − µ T ≠ 0 H0:µ N − µ T = 0
Let N be the number of genes tested and p the p-value of a given probe,
one computes an adjusted p-value using:
padjusted = p*N
Ramaswamy
& Golub 2002
Classification Clustering
(Supervised learning) (Unsupervised learning)
Classification Clustering
(Supervised learning) (Unsupervised learning)
A ABCDEFG A ACEFGa
B A B A
C B. C C.
C..
D D... (B,D)→ a D E.. (E,F)→ b
E E.... E F...
F F..... F G....
G G...... G a.....
A ACGab A
B A B CGac
C C. C C
D G.. (A,b)→ c D G. (C,G)→ d
E a... E a..
F b.... F c...
G G
A A A
B acd B B
C a C ae C
D c. (d,c)→ e D a (a,e)→ f D
E d.. E e. E
F F F
G G G
For a easier determination of clusters: length of branch is set in relation to the difference of the
leafs.
The quality of the clustering can (then) be determined by the ratio of the mean distance in the
cluster to the mean distance to points not in the cluster. Can be used as a measure for the
cluster borders.
http://www.itee.uq.edu.au/~comp4702/lectures/k-means_bis_1.jpg
RNA extraction,
cDNA rewriting,
labeling,
hybridization to microarray,
scanning,
spot detection,
spot intensity to numeric values,
normalization
Controlled MGED- ? ? ? ?
vocabulary ontology
GSM
GSE GDS
GPL raw-processed
grouping of chip data, grouping of
platform description intensities from a
a single experiment experiments
single or chip
Wright 2003
Wright 2003