0% found this document useful (0 votes)
64 views48 pages

Unit 3 - Clustering: K Means Algorithm

The document describes the k-means clustering algorithm and its implementation in R. It discusses the steps of k-means which are to choose the number of clusters k, initialize the centroids, assign points to the closest centroid, recompute the centroids, and repeat until convergence. It also discusses determining the number of clusters using within-sum-of-squares and performs k-means clustering on sample data in R. Finally, it provides an overview of decision trees for classification problems and demonstrates building a decision tree model in R using the rpart package.

Uploaded by

Dhanush Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views48 pages

Unit 3 - Clustering: K Means Algorithm

The document describes the k-means clustering algorithm and its implementation in R. It discusses the steps of k-means which are to choose the number of clusters k, initialize the centroids, assign points to the closest centroid, recompute the centroids, and repeat until convergence. It also discusses determining the number of clusters using within-sum-of-squares and performs k-means clustering on sample data in R. Finally, it provides an overview of decision trees for classification problems and demonstrates building a decision tree model in R using the rpart package.

Uploaded by

Dhanush Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Unit 3 -Clustering

K Means Algorithm
Algorithm Steps
• Choose the valueof k and the k initial guesses for the centroids

• Compute the centroid, the center of mass, of each newly defined


cluster from Step2.
• Repeat Steps 2 and 3 until the algorithm converges to
an answer.
1. Assign each point to the closest centroid computed in Step3.
2. Compute the centroid of newly defined clusters.
3. Repeat until the algorithm reaches the final answer.
Initial starting points forthe centroids
Points are assigned to the closest centroid
Compute the mean of each cluster
Determining the Number of Clusters
• WSS is the sum of the squares of the distances between each data
point and the closest centroid.
• The term indicates the closest centroid that is associated with
the ith point.
Using R to Perform a K-means Analysis
• WSS to determine an appropriate number, k of clusters.
library(plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(graphics)
library(grid)
library(gridExtra)
grade_input =as.data.frame(read.csv(“c:/data/grades_km_input.csv”))
kmdata_orig
=as.matrix(grade_input[,c(“Student”,“English”,“Math”,“Science”)])
kmdata<-kmdata_orig[,2:4]
kmdata[1:10,]

wss<-numeric(15)
for(k in 1:15)
wss[k]<-sum(kmeans(kmdata,centers=k, nstart=25)$withinss)

plot(1:15,wss,type=“b”,xlab=“Number of Clusters”,ylab=“Within Sum of


Squares”)
km=kmeans(kmdata,3,nstart=25) km
Classification
• prediction purposes
• Given input the goal is to predict a response or output
variable Y. Each member of the set is called an input
variable.
• The input values of a decision tree can be categorical or
continuous
• Test points(called nodes) and branches, which represent the decision
being made
• A node without further branches is called a leaf node.
Root Node: It represents entire population or sample and this further gets divided into two or
more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can
say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of
sub-nodes where as sub-nodes are the child of parent node.
Car dataset description
R in Decision Tree
• It holds tools for data splitting, pre-processing, feature selection,
tuning.
• Install Package
install.packages("caret")
install.packages("rpart.plot")
library(caret)
library(rpart.plot)
library(rpart)
library(rpart.plot)
setwd("C:/Users/HP/Desktop/MKCE/even sem/big data lab")
play_decision<-read.table("DTdata.csv",header=TRUE,sep=',')
summary(play_decision)
fit<-rpart(Play~Outlook+Temperature+Humidity+Wind,
method="class",
data=play_decision,
control=rpart.control(minsplit=1),
parms=list(split='information'))
summary(fit)
rpart.plot(fit,type=4,extra=1)
rpart.plot(fit, type=4, extra=2,clip.right.labs=FALSE,varlen=0,faclen=0)
newdata<-data.frame(Outlook="rainy",Temperature="mild",Humidity="hi
gh",Wind=FALSE)

predict(fit,newdata=newdata,type="prob")
predict(fit,newdata=newdata,type="class")
Naive Bayes

You might also like