0% found this document useful (0 votes)
49 views

Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University

This document outlines different clustering algorithms and approaches that can be used for analyzing gene expression data. It discusses gene-based clustering, sample-based clustering, and subspace clustering. Various clustering algorithms are described, including K-means, hierarchical clustering, self-organizing maps, graph-theoretical approaches, and model-based clustering. Both supervised and unsupervised methods for sample clustering are also covered. The document provides an overview of current research in clustering gene expression data.

Uploaded by

Sujan Gowda
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University

This document outlines different clustering algorithms and approaches that can be used for analyzing gene expression data. It discusses gene-based clustering, sample-based clustering, and subspace clustering. Various clustering algorithms are described, including K-means, hierarchical clustering, self-organizing maps, graph-theoretical approaches, and model-based clustering. Both supervised and unsupervised methods for sample clustering are also covered. The document provides an overview of current research in clustering gene expression data.

Uploaded by

Sujan Gowda
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

Cluster Analysis for Gene

Expression Data

Jiong Yang
EECS
Case Western Reserve University
Outlines
 Introduction

 Clustering Algorithms

 Class Validation

 Current and Future Research Directions


Clusters & Clustering
 Clustering is the process of grouping
data objects into a set of disjoint
classes, called clusters, so that objects
within a class have high similarity to
each other, while objects in separate
classes are more dissimilar.
Categories of gene expression
data clustering
 Gene expression profile: it is a matrix,
consists of a set of genes with a set of
samples.
 Gene-based clustering: the genes are
treated as the objects, while the samples are
the features.
 Sample-based clustering: the samples are
treated as the objects and the genes as the
features.
 Subspace clustering: capture clusters
formed by a subset of genes across a subset
of samples.
Proximity measurement for
gene expression data

Vectors Oi   oij |1  j  p
oij is the expression value of the ith gene under jth sample a
nd p is the number of samples
 Euclidean distance (Wang et al., 2002)
p
Euclidean(Oi , O j )   id jd
( o
d 1
 o ) 2

 Pearson’s correlation coefficient (Tang et a


l., 2002)


p
d 1
(oid  oi )(o jd  o j )
Pearson(Oi , O j ) 
 d 1 id oi  d 1 jd oj
p p
( o   ) 2
( o   ) 2
Outlines
 Introduction

 Clustering Algorithms

 Class Validation

 Current and Future Research Directions


Gene-based clustering
 The purpose of clustering gene expression data is
to reveal the natural data structures and gain
some initial insights regarding data distribution.

 Clustering algorithms for gene expression data


should be capable of extracting useful information
from a high level of background noise.

 A good clustering algorithm may not only partition


the data set but also provide some graphical
representation of the cluster structure
K-means (Tavazoie et al., 1999)

 Given a pre-specified number K, the algorithm


partitions the data set into K disjoint subsets
which optimize the following objective functio
n:
K
E    O  i
2

i 1 OCi

O is a data object in cluster C i and i is the centroid (mean of o


bjects) of C i
Hierarchical clustering (Eisen et a
l., 1998)
 Generates a hierarchical series of nested clu
sters which can be graphically represented
by a tree, called dendrogram.
 Agglomerative approaches (bottom-up a
pproach): single link, complete link and mini
mum-variance
 Divisive algorithms (top-down approach)
: deterministic annealing algorithm, graph t
heoretical methods
Self-organizing map (1) (Tamayo
et al., 1999)
 On the basis of a single layered neural
network
 Each gene is mapped to a high dimension
point
 A number of “virtual centers” chosen.
 An iterative process.
 For each gene, each center is moved toward the
gene. If a center is closer to the gene, the
center travels a larger distance.
Self-organizing map (2)
Graph-theoretical approaches
 Given a dataset X, a proximity matrix P, w
here P  i, j   proximity (Oi , O j ,) and a weight
ed graph , (where
V , E ) each data point co
rresponds to a vertex, the problem of clust
ering a dataset can be converted into findi
ng minimum cut or maximal cliques in the
graph
 CLICK (Shamir et al., 2000) & CAST (Ben-Dor et a
l., 1999)
Model-based clustering (Yeung e
t al., 2001)
 The data set is assumed to come from a finite
mixture of underlying probability distributions,
with each component corresponding to a different
cluster.

 The probabilistic feature of model-based


clustering is particularly suitable for gene
expression data

 The assumption that the data set fits a specific


distribution may not be true in many cases.
A density-based hierarchical a
pproach: DHC (Jiang et al., 2003)
 The basic idea is to consider a cluster as a
high-dimensional dense area, where data
objects are “attracted” with each other.

 DHC effectively detects the co-expressed


genes from noise, and thus is robust in the
noisy environment.
Summary of gene-based
clustering
 Some conventional clustering algorithm, such as K-means,
SOM and hierarchical approaches (UPGMA), is applied in
the early stage and proven to be useful

 Several new clustering algorithms, such as CLICK, CAST


and DHC, have been proposed specifically aiming at gene
expression data

 The performance of each clustering algorithm may vary


greatly with different data sets, and there is no absolute
“winner” among the clustering algorithms
Sample-based clustering
 The goal of sample-based clustering is to
find the phenotype structures or
substructures of the samples.

 Appling the conventional clustering


methods to cluster samples using all the
genes as features may seriously degrade
the quality and reliability of clustering
results
Clustering based on supervised
informative gene selection
 Training sample selection: a subset of samples is
selected to form the training set (less than 100)

 Informative gene selection: pick out those genes


whose expression patterns can distinguish different
phenotypes of samples.

 Sample clustering and classification: the whole


set of samples are clustered using only the informative
genes as features. Conventional clustering algorithms
are usually applied to cluster samples
Unsupervised clustering and
informative gene selection
 Unsupervised sample-based clustering assumes no
phenotype information being assigned to any sample.

 Unsupervised gene selection: First the gene


(feature) dimension is reduced, then the conventional
clustering algorithms are applied. PCA (Alter et al., 2000) &
F-statistic (Ding et al. 2002)
 Interrelated clustering: the relationship between
the genes and samples is dynamically maintained and
a clustering process and a gene selection process are
iteratively combined. CLIFF (Xing et al., 2001)
Summary of sample-based
clustering
 Supervised informative gene selection techniques
is widely applied, and relatively easy to get high
clustering accuracy rate
 Unsupervised sample-based clustering converges
into an accurate partition of the samples and a
set of informative genes as well. One drawback of
these approaches is that the gene filtering
process is non-invertible.
 Two more issues regarding the quality of sample-
based clustering techniques: the number of
clusters K, time complexity of the sample-based
clustering techniques
Subspace clustering
 Only a small subset of the genes
participates in any cellular
process of interest and that any
cellular process takes place only
in a subset of the samples

 A single gene may participate in


multiple pathways that may or
may not be coactive under all
conditions

 Subspace clustering methods


capture coherence exhibited by
the “blocks” within gene
expression matrices. A “block” is
a sub-matrix defined by a subset
of genes on a subset of samples.
Coupled two-way clustering (Getz
et al., 2000)
 CTWC provides a heuristic to avoid brute-force
enumeration of all possible combinations. Only
subsets of genes or samples that are identified as
stable clusters in previous iterations are candidates for
the next iteration. The iteration continues until no new
clusters are found which satisfy some criterion, such
as stability or critical size.

 CTWC searches for blocks in a deterministic manner


and the clustering results are therefore sensitive to
initial clustering settings.
Plaid model (Lazzeroni et al., 2002)
 The plaid model regards gene expression data
as a sum of multiple “layers”, where each layer
may represent the presence of a particular
biological process with only a subset of genes
and a subset of samples involved.

 The plaid model is based on the questionable


assumption that, if a gene participates in
several cellular processes, then its expression
level is the aggregation (sum) of the terms
involved in the individual processes
Biclustering (Cheng et al., 2000) and
δ-Clusters (Yang et al., 2002)
 The bicluster is finding a block, along with a score call
ed the mean-squared residue to measure the coheren
ce of genes and conditions in the block. A low mean-s
quared residue score together with a large variation fr
om the constant suggest a good criterion for identifyin
g a block

 δ-Clusters use average residue across every entry in t


he sub-matrix to measure the coherence within a sub
matrix. A heuristic move-based method called FLOC (F
lexible Overlapped Clustering) is applied to search K e
mbedded subspace clusters
Summary of subspace
clustering
 The genes in the “block” illustrate
coherent expression patterns under the
conditions within the same “block”.

 Different approaches adopt different


greedy heuristics to approximate the
optimal solution and make the problem
tractable
Outlines
 Introduction

 Clustering Algorithms

 Cluster Validation

 Current and Future Research Directions


Cluster validation
 Different clustering algorithms, or even a
single clustering algorithm using different
parameters, generally result in different sets of
clusters. Therefore, it is important to compare
various clustering results and select the one
that best fits the “true” data distribution.

 Three aspects: the quality of clusters,


comparing to a given “ground truth” of the
clusters, the reliability of the clusters
Homogeneity and separation
 The homogeneity of cluster C by the average pairwise object
similarity within C
 Oi ,O j C ,Oi  Oi
Similarity (Oi , O j )
H1 (C ) 
C  ( C  1)

 The homogeneity with respect to the “centroid” of the cluster


C
1
H 2 (C )   O C Similarity (Oi , O) Where O is the “centroid” of C
C i

 Cluster separation is analogously defined from various perspe


ctives to measure the dissimilarity between two clustersC1 an
d C2
 Oi C1 ,O j C2
Similarity (Oi , O j )
S1 (C1 , C2 )  and S 2 (C1 , C2 )  Similarity (O1 , O 2 )
C1  C2
Agreement with reference partitio
n (Halkidi et al., 2001)
 For clustering results, a matrix C can be constructed, Cij=1 if Oi and
Oj belong to the same cluster. Given a “ground truth” matrix P:
n11 is the number of object pairs (Oi,Oj), where Cij=1 and Pij=1
n10 is the number of object pairs (Oi,Oj), where Cij=1 and Pij=0
n01 is the number of object pairs (Oi,Oj), where Cij=0 and Pij=1
n00 is the number of object pairs (Oi,Oj), where Cij=0 and Pij=0

n11  n00
Rand index : Rand  ,
n11  n10  n01  n00
n11
Jaccard coefficient : JC  ,
n11  n10  n01
n10  n01
Minkowski measure : Minkowski 
n11  n01
Reliability of clusters
 P-value  f  g  f 
k 1   
i  n  i 
P  1  
i 0 g
 
n
where f is total number of genes within a functional category and g is the total
number of genes

 Prediction strength (Yeung et al., 2001): the generated clusters are


assessed by repeatedly measuring the prediction strength with
one or a few of the data objects left out in turn as “test sampl
es” while the remaining data objects are used for clustering.
Outlines
 Introduction

 Clustering Algorithms

 Class Validation

 Current and Future Research Directions


Current and future research
directions
 The performance of different clustering algorithms
and different validation approaches is strongly
dependent on both data distribution and application
requirements.
 The gene expression profile is very noisy. How to
deal or remove these noise is an important challenge.
 Integrating different biological knowledge, e.g.,
pathway, GO, etc. into the analysis process.
References(1)
 Alter O., Brown P.O. and Bostein D. Singular value decomposition for genome-wide expression data proces
sing and modeling. Proc. Natl. Acad. Sci. USA, Vol. 97(18):10101–10106, Auguest 2000.
 Ben-Dor A., Shamir R. and Yakhini Z. Clustering gene expression patterns. Journal of Computational Biolog
y, 6(3/4):281–297, 1999.
 Cheng Y., Church GM. Biclustering of expression data. Proceedings of the Eighth International Conference o
n Intelligent Systems for Molecular Biology (ISMB) , 8:93–103, 2000.
 Ding, Chris. Analysis of gene expression profiles: class discovery and leaf ordering. In Proc. of International
Conference on Computational Molecular Biology (RECOMB) , pages 127–136,Washington, DC., April 2002.
 Eisen, Michael B., Spellman, Paul T., Brown, Patrick O. and Botstein, David . Cluster analysis and display of
genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95(25):14863–14868, December 1998.
 Getz G., Levine E. and Domany E. Coupled two-way clustering analysis of gene microarray data. Proc. Natl.
Acad. Sci. USA, Vol. 97(22):12079–12084, October 2000.
 Halkidi, M., Batistakis, Y. and Vazirgiannis, M. On Clustering Validation Techniques. Intelligent Information
Systems Journal, 2001.
 Jiang, D., Pei, J. and Zhang, A. . DHC: A Density-based Hierarchical Clustering Method for Timeseries Gene
Expression Data. In Proceeding of BIBE2003: 3rd IEEE International Symposium on Bioinformatics and Bioe
ngineering, Bethesda, Maryland, March 10-12 2003.
 Lazzeroni, L. and Owen A. Plaid models for gene expression data. Statistica Sinica, 12(1):61–86, 2002.
 Shamir R. and Sharan R. Click: A clustering algorithm for gene expression analysis. In In Proceedings of th
e 8th International Conference on Intelligent Systems for Molecular Biology (ISMB ’00). AAAI Press., 2000.
References(2)
 Tamayo P., Solni D., Mesirov J., Zhu Q., Kitareewan S., Dmitrovsky E., Lander E.S. and Golub T.R. Interpret
ing patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differ
entiation. Proc. Natl. Acad. Sci. USA, Vol. 96(6):2907–2912, March 1999.
 Tang C., Zhang L., Zhang A. and Ramanathan M. Interrelated two-way clustering: An unsupervised approac
h for gene expression data analysis. In Proceeding of BIBE2001: 2nd IEEE International Symposium on Bioi
nformatics and Bioengineering, pages 41–48, Bethesda, Maryland, November 4-5 2001.
 Tang, Chun and Zhang, Aidong. An iterative strategy for pattern discovery in high-dimensional data sets. In
Proceeding of 11th International Conference on Information and Knowledge Management (CIKM 02) , McLe
an, VA, November 4-9 2002.
 Tavazoie, S., Hughes, D., Campbell, M.J., Cho, R.J. and Church, G.M. Systematic determination of genetic n
etwork architecture. Nature Genet, pages 281–285, 1999.
 Wang, Haixun, Wang, Wei, Yang, Jiong and Yu, Philip S. Clustering by Pattern Similarity in Large Data Sets.
In SIGMOD 2002, Proceedings ACM SIGMOD International Conference on Management of Data , pages 394
–405, 2002.
 Xing, E.P. and Karp, R.M. Cliff: Clustering of high-dimensional microarray data via iterative feature filtering
using normalized cuts. Bioinformatics, Vol. 17(1):306–315, 2001.
 Yang, Jiong, Wang, Wei, Wang, Haixun and Yu, Philip S. -cluster: Capturing Subspace Correlation in a Larg
e Data Set. In Proceedings of 18th International Conference on Data Engineering (ICDE 2002) , pages 517–
528, 2002.
 Yeung, K.Y., Fraley, C, Murua, A., Raftery, AE., Ruzz WL. Model-based clustering and data transformations f
or gene expression data. Bioinformatics, 17:977–987, 2001.
 Yeung, K.Y., Haynor, D.R. and Ruzzo, W.L. Validating Clustering for Gene Expression Data. Bioinformatics,
Vol.17(4):309–318, 2001.
Thanks
!

You might also like