Unit 3 -Clustering
K Means Algorithm
Algorithm Steps
• Choose the valueof k and the k initial guesses for the centroids
• Compute the centroid, the center of mass, of each newly defined
cluster from Step2.
• Repeat Steps 2 and 3 until the algorithm converges to
an answer.
1. Assign each point to the closest centroid computed in Step3.
2. Compute the centroid of newly defined clusters.
3. Repeat until the algorithm reaches the final answer.
Initial starting points forthe centroids
Points are assigned to the closest centroid
Compute the mean of each cluster
Determining the Number of Clusters
• WSS is the sum of the squares of the distances between each data
point and the closest centroid.
• The term indicates the closest centroid that is associated with
the ith point.
Using R to Perform a K-means Analysis
• WSS to determine an appropriate number, k of clusters.
library(plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(graphics)
library(grid)
library(gridExtra)
grade_input =as.data.frame(read.csv(“c:/data/grades_km_input.csv”))
kmdata_orig
=as.matrix(grade_input[,c(“Student”,“English”,“Math”,“Science”)])
kmdata<-kmdata_orig[,2:4]
kmdata[1:10,]
wss<-numeric(15)
for(k in 1:15)
wss[k]<-sum(kmeans(kmdata,centers=k, nstart=25)$withinss)
plot(1:15,wss,type=“b”,xlab=“Number of Clusters”,ylab=“Within Sum of
Squares”)
km=kmeans(kmdata,3,nstart=25) km
Classification
• prediction purposes
• Given input the goal is to predict a response or output
variable Y. Each member of the set is called an input
variable.
• The input values of a decision tree can be categorical or
continuous
• Test points(called nodes) and branches, which represent the decision
being made
• A node without further branches is called a leaf node.
Root Node: It represents entire population or sample and this further gets divided into two or
more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can
say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of
sub-nodes where as sub-nodes are the child of parent node.
Car dataset description
R in Decision Tree
• It holds tools for data splitting, pre-processing, feature selection,
tuning.
• Install Package
install.packages("caret")
install.packages("rpart.plot")
library(caret)
library(rpart.plot)
library(rpart)
library(rpart.plot)
setwd("C:/Users/HP/Desktop/MKCE/even sem/big data lab")
play_decision<-read.table("DTdata.csv",header=TRUE,sep=',')
summary(play_decision)
fit<-rpart(Play~Outlook+Temperature+Humidity+Wind,
method="class",
data=play_decision,
control=rpart.control(minsplit=1),
parms=list(split='information'))
summary(fit)
rpart.plot(fit,type=4,extra=1)
rpart.plot(fit, type=4, extra=2,clip.right.labs=FALSE,varlen=0,faclen=0)
newdata<-data.frame(Outlook="rainy",Temperature="mild",Humidity="hi
gh",Wind=FALSE)
predict(fit,newdata=newdata,type="prob")
predict(fit,newdata=newdata,type="class")
Naive Bayes