0% found this document useful (0 votes)

11 views11 pages

Clustering 2

Uploaded by

rodney ortiz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views11 pages

Clustering 2

Uploaded by

rodney ortiz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Clustering – Exercises

This exercise introduces some clustering methods available in R and Bioconductor. For this
exercise, you’ll need the kidney dataset: Go to menu File, and select Change Dir. The kidney
dataset is under data-folder on your desktop.

1. Reading the prenormalized data

Read in the prenormalized Spellman’s yeast dataset:

> d<-read.table("combined.txt", sep="\t", header=T, row.names=1)

We want only the cdc15 data, so take only those columns from the data:

> names(d)
> da<-data.frame(d[26:49])

Remove missing values from the data:

> dat<-na.omit(da)

2. Make some normalization checks:

Make a boxplot of normalized chips. Note the effect of normalization:

> boxplot(dat, outline=F, las=2, cex.axis=0.7)

3. Filter the genelist by standard deviation

Select only the genes that are among the 0.3% of the highest standard deviations.

> library(genefilter)
> percentage<-c(0.997)
> # This calculates a genewise standard deviation
> sds<-rowSds(dat)
> # Select 0.3% of the genes having the highest SD
> sel<-(sds>quantile(sds,percentage))
> set<-dat[sel, ]

How many genes are left after filtering?

4. Visualizing a hierarchical sample tree using euclidian distance

Visualize the dependencies between samples using a hierarchical clustering method (euclidian
distances and average linkage). The first step is to calculate the distances between samples using a
selected distance method. A tree is drawn from these distances using a selected drawing method.

> # Distances between observations are calculated using euclidian

distances
> distmeth<-c("euclidian")
> # To calculate a tree for sample, the data matrix need to be
transposed
> D<-dist(t(set), method=distmeth)
> # Tree is formed using average distance between clusters
> treemeth<-c("average")
> hc<-hclust(D, method=treemeth)
> plot(hc)

Draw a tree using single and complete linkage. Do the trees differ from each other?
5. Visualizing a hierarchical gene tree using euclidian distance

Visualize the dependencies between samples using a hierarchical clustering method (euclidian
distances and average linkage). Here gene names are used as labels on the leaves.

> distmeth<-c("euclidian")
> D<-dist(set, method=distmeth)
> treemeth<-c("average")
> hc<-hclust(D, method=treemeth)
> plot(hc, labels=row.names(set))

6. Visualizing a hierarchical gene tree using correlation

We used euclidian distance in the examples above. More often gene expression profiles are
clustered using correlation coefficients. To do this you need to calculate correlation between the
genes, and reformat the corralation matrix into an object containing distances. After that the tree
drawing proceeds normally.

> cor.pe<-cor(t(set), method=c(“pearson”))

> cor.sp<-cor(t(set), method=c(“spearman”))
> dist.pe<- as.dist(1-cor.pe)
> dist.sp<- as.dist(1-cor.sp)
> hc<-hclust(dist.pe, method=treemeth)
> plot(hc, labels=row.names(set), main="Cluster Dendrogram,
Pearson correlation")
Draw the same tree using Spearman correlation. Do the results differ?

7. Visualizing the correlation matrix

Visualize the correlation matrix between the genes.

> # Dists are converted to a matrix, otherwise this doesn’t work

> image(as.matrix(dist.pe))
Similarly, visualize the distance matrix between samples:

> D<-dist(t(set), method=distmeth)

> image(as.matrix(D))

These images give you a view to the distance matrix even without the dendrogram. If you look at
the image generated from samples, you’ll notice that there are some clusters of highly correlated
samples, mostly near the diagonal line running from lower left-hand corner to the upper right-hand
corner. In other words, the time points following each other are closer to each other than to other
time points (what a surprise!).

8. Visualizing a heatmap

By heatmap we mean a colored figure often appearing in articles. It is basically a hierarchical

clustering results, where every gene is represented by a colored bar.

Heatmap can be produced simply by:

> # Heatmap needs to get the data as a matrix. Set is a data

frame, and is first converted into a matrix.
> heatmap(as.matrix(set))
To get other colors in the heatmap, you first need to generate a sequence of colors, and then plot the
heatmap using these colors:

> library(RColorBrewer)
> Uses 256 shades of Red and Blue
> heatcol<-colorRampPalette(brewer.pal(10, "RdBu"))(256)
> heatmap(as.matrix(set), col=heatcol)
Often the heatmaps are represented using red and green colors. To get this kind of an image, use:

> heatcol<-colorRampPalette(c("Red", "Green"))(32)

> heatmap(as.matrix(set), col=heatcol)

9. Saving the heatmap into a file

For further modifications, the heatmap might need to be saved in a file. This is accomplished with:

> cwd=getwd()
> bmp(file.path(cwd, "heatmap.bmp"), width=1800, height=1800)
> heatmap(as.matrix(set), col=heatcol)
> dev.off()

This results into about 6*6 inch print quality bitmap image in your data folder. Some papers might
want to get a postscript image, and this is accomplished as:

> cwd=getwd()
> postscript(file.path(cwd, "heatmap.ps"), width=1800,
height=1800)
> heatmap(as.matrix(set), col=heatcol)
> dev.off()
10. K-means clustering of genes

In K-means clustering you need to pick an artificial number, the number of clusters (K).

To produce a K-means clustering with 5 clusters, type:

> k<-c(5)
> km<-kmeans(set, k, iter.max=1000)
> km

K-means clustering with 5 clusters of sizes 3, 2, 5, 2, 2

Cluster means:
cdc15_10 cdc15_30 cdc15_50 cdc15_70 cdc15_80 cdc15_90 cdc15_100
1 -2.320 -2.813333 -3.423333 -0.9033333 -0.420 0.6666667 0.8366667
2 2.350 2.395000 -0.145000 -1.8700000 -2.230 -2.2400000 -1.9650000 -
3 -1.984 0.254000 -0.290000 -1.6460000 -1.318 -1.5540000 -1.5520000 -
4 -1.170 -2.620000 -2.870000 -1.0550000 -0.235 0.6000000 1.9600000
5 0.520 0.800000 0.975000 -1.8450000 0.700 -3.0300000 1.1600000 -
cdc15_150 cdc15_160 cdc15_170 cdc15_180 cdc15_190 cdc15_200 cdc15_210
1 -0.09666667 -0.300 -0.1933333 0.150 0.006666667 0.960 1.106667
2 0.42000000 -0.015 -0.7800000 -1.185 -0.655000000 -0.620 -0.030000
3 1.48800000 1.376 0.5340000 0.118 -0.806000000 -0.428 -0.754000 -
4 -1.20000000 -0.855 -1.2400000 0.550 0.315000000 1.280 1.150000
5 0.79500000 0.830 0.8400000 0.825 -1.615000000 0.385 -2.365000
cdc15_270 cdc15_290
1 0.2066667 0.640
2 0.7300000 0.175
3 1.8680000 1.976
4 -0.7250000 -0.610
5 0.6850000 0.780

Clustering vector:
YBL051C YBR092C YBR110W YER124C YGL055W YGL089C YHR143W YJR004C YKL164C YLR286C
5 4 5 3 2 3 3 3 2 3

Within cluster sum of squares by cluster:

[1] 8.585133 25.912400 58.457560 6.067450 6.896450

Available components:
[1] "cluster" "centers" "withinss" "size"

Calculate an average withinness of the results. This is a measure of how close together genes lie
inside the clusters.

> mean(km$withinss)
[1] 21.1838

Run the same K-means analysis several times (save the result into a new object every time). Select
the K-means clustering giving the smallest withinness score as the best result.
Visualize the K-means clustering as follows. Save the cluster membership as a new variable, and
use it for coloring the data points. Last, add the cluster centers to the image.

> cl<-km$cluster
> plot(set[,1], set[,2], col=cl)
> points(km$centers, col = 1:5, pch = 8)

To visualize one cluster, and its expression, select the genes that belong to the same cluster. Draw
the expression profile of these genes into the same image using different colors.

> set3<-data.frame(set, cl)

> cl1<- set[which(cl==1),] # Here we have 3 genes
> plotcol<-colorRampPalette(c("Grey", "Black"))(5)
> plot(t(cl1[1,]), type="l", col=plotcol[1])
> lines(t(cl1[2,]), type="l", col=plotcol[3])
> lines(t(cl1[3,]), type="l", col=plotcol[5])

How could you plot the genes in the same image with different line types (hint: ?par)?

Draw a similar image for the cluster 2.

12. Drawing the whole K-means clustering

Let’s produce a new K-means clustering result using four clusters:

> km<-kmeans(set, 4, iter.max=1000)

Next, initiate a 2*2 image area, and draw the expression profiles. We need to apply a for-loop here:

> par(mfrow=c(2,2))
> for(i in 1:4) {
> matplot(t(set[km$cluster==i,]), type="l",
main=paste(“cluster:”, i), ylab=”log expression”, xlab=”time”)
> }
11. Annotata the results

Now that you have successfully found some interesting clusters, you should check what kind of
genes are in the clusters. Load the yeast annotation package YEAST, and extract the gene names
and descriptions. This annotates all the genes retained after filtering:

> library(YEAST)
> ls(package: YEAST)
> genes<-as.vector(row.names(set))
> annot1<-unlist(mget(genes, YEASTGENENAME))
> annot2<-unlist(mget(genes, YEASTDESCRIPTION))
> annot<-data.frame(rbind(annot1,annot2))
> write.table(annot, “annot.txt”)
> annot[,1] # prints gene names
> annot[,2] # prints gene descriptions

To get the descriptions of the genes in cluster 1:

> cl1.annot<-data.frame(rbind(unlist(mget(row.names(cl1),
YEASTGENENAME)), unlist(mget(row.names(cl1),
YEASTDESCRIPTION))))
> cl1.annot

Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
100% (1)
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
424 pages
Tahoe Salt
100% (1)
Tahoe Salt
12 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
DATAMINING
No ratings yet
DATAMINING
24 pages
Lecture_9
No ratings yet
Lecture_9
38 pages
Lecture 7 - Integrated Analysis With R
No ratings yet
Lecture 7 - Integrated Analysis With R
79 pages
Mla - 2 (Cia - 3) - 20221013
No ratings yet
Mla - 2 (Cia - 3) - 20221013
21 pages
Creating Heatmaps with Hierarchical Clustering
No ratings yet
Creating Heatmaps with Hierarchical Clustering
14 pages
combined-91-105
No ratings yet
combined-91-105
15 pages
Instant Access to Practical Guide to Cluster Analysis in R Unsupervised Machine Learning Alboukadel Kassambara ebook Full Chapters
100% (2)
Instant Access to Practical Guide to Cluster Analysis in R Unsupervised Machine Learning Alboukadel Kassambara ebook Full Chapters
52 pages
A Grammar of Graphics
0% (1)
A Grammar of Graphics
45 pages
Get Practical Guide to Cluster Analysis in R Unsupervised Machine Learning Alboukadel Kassambara free all chapters
100% (1)
Get Practical Guide to Cluster Analysis in R Unsupervised Machine Learning Alboukadel Kassambara free all chapters
55 pages
R_language Lab Manual_pg 2024
No ratings yet
R_language Lab Manual_pg 2024
29 pages
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
No ratings yet
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
34 pages
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
No ratings yet
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
32 pages
R Fourier
No ratings yet
R Fourier
18 pages
Cluster Analysis
No ratings yet
Cluster Analysis
8 pages
R Tutorial: Geometric Interpretation of Gene Co-Expression Network Analysis, Applied To Brain Cancer Microarray Data
No ratings yet
R Tutorial: Geometric Interpretation of Gene Co-Expression Network Analysis, Applied To Brain Cancer Microarray Data
27 pages
Consensus Cluster Plus
No ratings yet
Consensus Cluster Plus
12 pages
Agenda: 1. Introduction To Clustering
No ratings yet
Agenda: 1. Introduction To Clustering
47 pages
datamininganddataware
No ratings yet
datamininganddataware
25 pages
ConsensusClusterPlus
No ratings yet
ConsensusClusterPlus
12 pages
Da 06-10
No ratings yet
Da 06-10
14 pages
Cluster Analysis Using Dicer: Install - Packages
No ratings yet
Cluster Analysis Using Dicer: Install - Packages
8 pages
Gene and Sample Clustering
No ratings yet
Gene and Sample Clustering
5 pages
combined-76-90
No ratings yet
combined-76-90
15 pages
Cluster
No ratings yet
Cluster
3 pages
Ggbio
No ratings yet
Ggbio
266 pages
Visualizing Data: BINF733 SPRING2006 Dr. Jeff Solka and Dr. Jennifer Weller
No ratings yet
Visualizing Data: BINF733 SPRING2006 Dr. Jeff Solka and Dr. Jennifer Weller
76 pages
MicroArray Analysis - 201
No ratings yet
MicroArray Analysis - 201
13 pages
R Reference Card For Data Mining
No ratings yet
R Reference Card For Data Mining
3 pages
(Use R!) Keon-Woong Moon - Learn Ggplot2 Using Shiny App (2017, Springer) PDF
100% (3)
(Use R!) Keon-Woong Moon - Learn Ggplot2 Using Shiny App (2017, Springer) PDF
356 pages
MIT 302 - Statistical Computing II - Tutorial 04
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 04
7 pages
Datamining Lab Record
No ratings yet
Datamining Lab Record
36 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
Lab Manual _DSR
No ratings yet
Lab Manual _DSR
32 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
RDataMining Reference Card
No ratings yet
RDataMining Reference Card
5 pages
2 R - Zajecia - 4 - Eng
No ratings yet
2 R - Zajecia - 4 - Eng
7 pages
Cluster
No ratings yet
Cluster
2 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
Ggplot2 Slides
No ratings yet
Ggplot2 Slides
82 pages
Ggplot2 Exercise
No ratings yet
Ggplot2 Exercise
6 pages
Beautiful Graphics in R
No ratings yet
Beautiful Graphics in R
238 pages
iris_hc_solution
No ratings yet
iris_hc_solution
31 pages
Rcourse_partViz
No ratings yet
Rcourse_partViz
9 pages
Aman DA 111
No ratings yet
Aman DA 111
14 pages
10-Visualization of Streaming Data and Class R Code-10!03!2023
No ratings yet
10-Visualization of Streaming Data and Class R Code-10!03!2023
19 pages
Handy R Stuff
No ratings yet
Handy R Stuff
5 pages
Guide To Create: Beautiful Graphics in R
No ratings yet
Guide To Create: Beautiful Graphics in R
48 pages
CSE 3121 Information Visualization R Studio All Codes
No ratings yet
CSE 3121 Information Visualization R Studio All Codes
9 pages
R Tools Manual New
No ratings yet
R Tools Manual New
35 pages
5th Report
No ratings yet
5th Report
23 pages
Graphics Chapter
No ratings yet
Graphics Chapter
49 pages
DV Assignment-1
No ratings yet
DV Assignment-1
10 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
From Everand
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
Kanto
No ratings yet
review 10
No ratings yet
review 10
10 pages
Revise Red Hen
No ratings yet
Revise Red Hen
4 pages
Detailed Lesson Plan in English 8 II - Copy
No ratings yet
Detailed Lesson Plan in English 8 II - Copy
9 pages
Lesson Plan 12
No ratings yet
Lesson Plan 12
7 pages
Long Quiz
No ratings yet
Long Quiz
1 page
Title
No ratings yet
Title
1 page
DLL Eng8 4thQ Week 2
No ratings yet
DLL Eng8 4thQ Week 2
9 pages
DLL Eng8 4thQ Week 1
No ratings yet
DLL Eng8 4thQ Week 1
9 pages
Delf
No ratings yet
Delf
4 pages
Table of Specification Grade 7
No ratings yet
Table of Specification Grade 7
1 page
Salug II Sportsfest Cover Page
No ratings yet
Salug II Sportsfest Cover Page
1 page
TCR - Tle CSS 10
No ratings yet
TCR - Tle CSS 10
11 pages
TCR Aral-Pan G8
No ratings yet
TCR Aral-Pan G8
12 pages
Research Sampling Designs: (PA 298 Research For Social Science)
No ratings yet
Research Sampling Designs: (PA 298 Research For Social Science)
19 pages
Univariate Statistics
No ratings yet
Univariate Statistics
7 pages
Biostatistics and Research Methodology
100% (1)
Biostatistics and Research Methodology
3 pages
StatProb q3 Mod5 Sampling-and-Sampling-Distributions
100% (4)
StatProb q3 Mod5 Sampling-and-Sampling-Distributions
27 pages
Handbook of Advanced Multilevel Analysis European Association of Methodology Series 1st Edition Joop Hox All Chapter Instant Download
100% (7)
Handbook of Advanced Multilevel Analysis European Association of Methodology Series 1st Edition Joop Hox All Chapter Instant Download
84 pages
Factors Influencing Quality of Construction Projects in Cambodia
No ratings yet
Factors Influencing Quality of Construction Projects in Cambodia
11 pages
Review-Validation of QSAR Models-Strategies and Importance
No ratings yet
Review-Validation of QSAR Models-Strategies and Importance
9 pages
Statistical Genetics of Quantitative Traits Linkage, Maps and QTL Complete Chapter Download
100% (14)
Statistical Genetics of Quantitative Traits Linkage, Maps and QTL Complete Chapter Download
17 pages
AK - STATISTIKA - 02 - Describing Data (Cont.)
No ratings yet
AK - STATISTIKA - 02 - Describing Data (Cont.)
47 pages
Lambert 1992
No ratings yet
Lambert 1992
15 pages
Assignment 2solution
No ratings yet
Assignment 2solution
13 pages
Kurtosis
No ratings yet
Kurtosis
8 pages
NMT06105 CAT TW0 2024 (2)
No ratings yet
NMT06105 CAT TW0 2024 (2)
16 pages
Event History Analysis With R PDF
No ratings yet
Event History Analysis With R PDF
2 pages
B.SC Stats
No ratings yet
B.SC Stats
6 pages
stat400-hw07-Fa24
No ratings yet
stat400-hw07-Fa24
3 pages
Stat Practical 3
No ratings yet
Stat Practical 3
8 pages
Business Analytics For Decision Making
No ratings yet
Business Analytics For Decision Making
3 pages
Chapter Iv
No ratings yet
Chapter Iv
5 pages
Errors in Hypothetical Testing Basic
No ratings yet
Errors in Hypothetical Testing Basic
3 pages
Badenes - Misconception On The P-Value Among Chilean and Italian Academic Psychologists
No ratings yet
Badenes - Misconception On The P-Value Among Chilean and Italian Academic Psychologists
9 pages
Normal Distribution
No ratings yet
Normal Distribution
9 pages
Schools Division Office of Camarines Sur Learning Activity Sheet No. 11
100% (1)
Schools Division Office of Camarines Sur Learning Activity Sheet No. 11
10 pages
Anova Dua Jalur Dengan Interaksi - Ika Hardina Putri Febrian PDF
No ratings yet
Anova Dua Jalur Dengan Interaksi - Ika Hardina Putri Febrian PDF
13 pages
15 - Statistical Quality Control
No ratings yet
15 - Statistical Quality Control
82 pages
ANOVA Practical
No ratings yet
ANOVA Practical
7 pages
Moments and Measures of Skewness and Kurtosis
0% (1)
Moments and Measures of Skewness and Kurtosis
2 pages
The Design and Statistical Analysis of Animal Experiments 1st Edition New Edition PDF
100% (11)
The Design and Statistical Analysis of Animal Experiments 1st Edition New Edition PDF
14 pages
Eloisa Jasmin F. Perez E3Q - Engineering Data Analysis Formative Assessment
No ratings yet
Eloisa Jasmin F. Perez E3Q - Engineering Data Analysis Formative Assessment
2 pages