Clustering 2
Clustering 2
This exercise introduces some clustering methods available in R and Bioconductor. For this
exercise, you’ll need the kidney dataset: Go to menu File, and select Change Dir. The kidney
dataset is under data-folder on your desktop.
We want only the cdc15 data, so take only those columns from the data:
> names(d)
> da<-data.frame(d[26:49])
> dat<-na.omit(da)
Select only the genes that are among the 0.3% of the highest standard deviations.
> library(genefilter)
> percentage<-c(0.997)
> # This calculates a genewise standard deviation
> sds<-rowSds(dat)
> # Select 0.3% of the genes having the highest SD
> sel<-(sds>quantile(sds,percentage))
> set<-dat[sel, ]
Visualize the dependencies between samples using a hierarchical clustering method (euclidian
distances and average linkage). The first step is to calculate the distances between samples using a
selected distance method. A tree is drawn from these distances using a selected drawing method.
Draw a tree using single and complete linkage. Do the trees differ from each other?
5. Visualizing a hierarchical gene tree using euclidian distance
Visualize the dependencies between samples using a hierarchical clustering method (euclidian
distances and average linkage). Here gene names are used as labels on the leaves.
> distmeth<-c("euclidian")
> D<-dist(set, method=distmeth)
> treemeth<-c("average")
> hc<-hclust(D, method=treemeth)
> plot(hc, labels=row.names(set))
We used euclidian distance in the examples above. More often gene expression profiles are
clustered using correlation coefficients. To do this you need to calculate correlation between the
genes, and reformat the corralation matrix into an object containing distances. After that the tree
drawing proceeds normally.
These images give you a view to the distance matrix even without the dendrogram. If you look at
the image generated from samples, you’ll notice that there are some clusters of highly correlated
samples, mostly near the diagonal line running from lower left-hand corner to the upper right-hand
corner. In other words, the time points following each other are closer to each other than to other
time points (what a surprise!).
8. Visualizing a heatmap
> library(RColorBrewer)
> Uses 256 shades of Red and Blue
> heatcol<-colorRampPalette(brewer.pal(10, "RdBu"))(256)
> heatmap(as.matrix(set), col=heatcol)
Often the heatmaps are represented using red and green colors. To get this kind of an image, use:
For further modifications, the heatmap might need to be saved in a file. This is accomplished with:
> cwd=getwd()
> bmp(file.path(cwd, "heatmap.bmp"), width=1800, height=1800)
> heatmap(as.matrix(set), col=heatcol)
> dev.off()
This results into about 6*6 inch print quality bitmap image in your data folder. Some papers might
want to get a postscript image, and this is accomplished as:
> cwd=getwd()
> postscript(file.path(cwd, "heatmap.ps"), width=1800,
height=1800)
> heatmap(as.matrix(set), col=heatcol)
> dev.off()
10. K-means clustering of genes
In K-means clustering you need to pick an artificial number, the number of clusters (K).
> k<-c(5)
> km<-kmeans(set, k, iter.max=1000)
> km
Cluster means:
cdc15_10 cdc15_30 cdc15_50 cdc15_70 cdc15_80 cdc15_90 cdc15_100
1 -2.320 -2.813333 -3.423333 -0.9033333 -0.420 0.6666667 0.8366667
2 2.350 2.395000 -0.145000 -1.8700000 -2.230 -2.2400000 -1.9650000 -
3 -1.984 0.254000 -0.290000 -1.6460000 -1.318 -1.5540000 -1.5520000 -
4 -1.170 -2.620000 -2.870000 -1.0550000 -0.235 0.6000000 1.9600000
5 0.520 0.800000 0.975000 -1.8450000 0.700 -3.0300000 1.1600000 -
cdc15_150 cdc15_160 cdc15_170 cdc15_180 cdc15_190 cdc15_200 cdc15_210
1 -0.09666667 -0.300 -0.1933333 0.150 0.006666667 0.960 1.106667
2 0.42000000 -0.015 -0.7800000 -1.185 -0.655000000 -0.620 -0.030000
3 1.48800000 1.376 0.5340000 0.118 -0.806000000 -0.428 -0.754000 -
4 -1.20000000 -0.855 -1.2400000 0.550 0.315000000 1.280 1.150000
5 0.79500000 0.830 0.8400000 0.825 -1.615000000 0.385 -2.365000
cdc15_270 cdc15_290
1 0.2066667 0.640
2 0.7300000 0.175
3 1.8680000 1.976
4 -0.7250000 -0.610
5 0.6850000 0.780
Clustering vector:
YBL051C YBR092C YBR110W YER124C YGL055W YGL089C YHR143W YJR004C YKL164C YLR286C
5 4 5 3 2 3 3 3 2 3
Available components:
[1] "cluster" "centers" "withinss" "size"
Calculate an average withinness of the results. This is a measure of how close together genes lie
inside the clusters.
> mean(km$withinss)
[1] 21.1838
Run the same K-means analysis several times (save the result into a new object every time). Select
the K-means clustering giving the smallest withinness score as the best result.
Visualize the K-means clustering as follows. Save the cluster membership as a new variable, and
use it for coloring the data points. Last, add the cluster centers to the image.
> cl<-km$cluster
> plot(set[,1], set[,2], col=cl)
> points(km$centers, col = 1:5, pch = 8)
To visualize one cluster, and its expression, select the genes that belong to the same cluster. Draw
the expression profile of these genes into the same image using different colors.
How could you plot the genes in the same image with different line types (hint: ?par)?
Next, initiate a 2*2 image area, and draw the expression profiles. We need to apply a for-loop here:
> par(mfrow=c(2,2))
> for(i in 1:4) {
> matplot(t(set[km$cluster==i,]), type="l",
main=paste(“cluster:”, i), ylab=”log expression”, xlab=”time”)
> }
11. Annotata the results
Now that you have successfully found some interesting clusters, you should check what kind of
genes are in the clusters. Load the yeast annotation package YEAST, and extract the gene names
and descriptions. This annotates all the genes retained after filtering:
> library(YEAST)
> ls(package: YEAST)
> genes<-as.vector(row.names(set))
> annot1<-unlist(mget(genes, YEASTGENENAME))
> annot2<-unlist(mget(genes, YEASTDESCRIPTION))
> annot<-data.frame(rbind(annot1,annot2))
> write.table(annot, “annot.txt”)
> annot[,1] # prints gene names
> annot[,2] # prints gene descriptions
> cl1.annot<-data.frame(rbind(unlist(mget(row.names(cl1),
YEASTGENENAME)), unlist(mget(row.names(cl1),
YEASTDESCRIPTION))))
> cl1.annot