Storage Technologies: Digital Assignment 1
Storage Technologies: Digital Assignment 1
STORAGE
TECHNOLOGIES
ITE2009
DIGITAL ASSIGNMENT 1
SUMIT PATIL
SLOT: D1+TD2
FACULTY BHAVANI S
DA1 2
PAPER 1
A Modified K-Means Algorithm for Big Data Clustering
SK Ahammad Fahad, Md. Mahbub Alam2 IBAIS University, Dhaka, Bangladesh
DUET, Dhaka, Bangladesh
MODIFIED ALGORITHM
DA1 4
RESULTS
DA1 5
PAPER 2
K-means algorithm is one of the most popular algorithms for data clustering. With
this algorithm, data of similar types are tried to be clustered together from a large
data set with brute force strategy which is done by repeated calculations. As a
result, the computational complexity of this algorithm is very high. Several
researches have been carried out to minimize this complexity. This paper presents
the result of our research, which proposes a modified version of k-means
algorithm with an improved technique to divide the data set into specific
numbers of clusters with the help of several check point values. It requires less
computation and has enhanced accuracy than the traditional k-means algorithm
as well as some modified variant of the traditional kMeans.
MODIFIED ALGORITHM
Step 1:
Find the Euclidian distance of each data object from the origin (0i,0j0n).
Here we randomly select N data objects as initial origin. Then we find out the
Euclidian distance between each data object with respect to the origin.
DA1 6
Step 2:
Sort the N-data object in ascending order according to the distance found in the
previous step.
Step 3:
Divide the data set into K equal clusters. K will be determined according to the
user requirement or on the type of the data set. This will act as the primary
cluster.
For setting up the initial cluster this step is necessary. Depending on the number
of cluster needed we now divide the whole data set into equal portion. For every
situation this may not be the case as there may not be equal numbers of object in
every data set. As example, if we have 1000 data objects and we have to divide
them into 3 clusters then the cluster may have 333,333,334 numbers of objects in
each of it.
Step 4:
For each cluster, consider the middle point as the primary cluster center. That is,
if there is N data members and K clusters, the primary cluster center will be (
(n/k)/2)th object.
As this data set is obtained from the distance from the initial origins, so the center
points will be the most significant points in each clusters from which all the data
objects will mostly have a unified distance.
Step 5:
Find the distance between the cluster centers. If there are K clusters, there will be
Kdistances. Divide the distance by 2 and store the value in Dij (i, j=0,1,k). Here
Dij denotes the middle point of the distance from cluster center i to cluster center
j. This Dij will be used as a check point value.
For example, if cluster A and B have cluster centers Ai and Bi , and suppose the
center point of the distance between Ai and Bi will denote a point where the
DA1 7
Step 6:
Find the Euclidian distance of each data object di (i=1. k) from the cluster center
it is assigned to.
Step 7:
If the distance is less than or equal to Dij, then the object stays in the previous
cluster.
That is, the distance from the current cluster center is less than the distance from
the center points of two cluster centers. As a result we can conclude that this
object is closer to its current cluster. Hence we do not need to calculate the
distance from other cluster center. This check point value will ensure that we
need less computation.
Else calculate the Euclidian distance of the data object with respect to the center
with which the distance crossed the check point value. That is, if Dij is exceeded
and the object was previously in the cluster with center i, then compute the
distance with respect to cluster center j.
Means the object may be closer from the other cluster center. To be sure about
this, we have to calculate the distance with respect to other cluster center.
Now compare the distances. Assign the data object to the cluster from whose
center it has a shorter distance.
Recalculate the cluster centers by taking the mean of every objects currently
present in one cluster. This point can be an imaginary point which has no
existence in our current data set or can be any current object of our dataset. This
will not affect the outcome of our algorithm.
DA1 8
Go back to step 4 and repeat until the convergence criteria is met. That is no data
object is moving from one cluster to another cluster after the cluster center is
changed. That results in the object of the cluster remains same, hence the center
also remains unchanged. Now we can draw the conclusion that we have achieved
the final clusters. That is we grouped together similar objects in each, which may
be different from other clusters.
PAPER 3
An Improvement in K-mean Clustering Algorithm Using
Better Time and Accuracy
Er. Nikhil Chaturvedi and Er. Anand Rajavat
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense
or another) to each other than to those in other groups (clusters).K-means is one
of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The process of k means algorithm data is partitioned into K
clusters and the data are randomly choose to the clusters resulting in clusters that
have the same number of data set. This paper is proposed a new K means
clustering algorithm we calculate the initial centroids systemically instead of
random assigned due to which accuracy and time improved.
DA1 9
MODIFIED ALGORITHM
Phase 1: For the initial centroids
Steps:
1. Set p = 1;
2. Measure the distance between each data and all other data in the set D;
3. Find the closest pair of data from the set D and form a data set Ap (1<= p <= k)
which contains these two data, Delete these two data from the set D;
4. Find the data in D that is closest to the data set Ap, Add it to Ap and delete it
from D;
6. If p<=p<=k) find the mean of data in Ap. These means will be the initial
centroids.
Steps:
2. for each data di find the closest centroid ci and assing to cluster j.
6. Repeat
7.1 Compute the distance from the centroids of the closest cluster;
DA1 10
7.2 If distance is less than or equal to the present closest distance, the data-
point stays in cluster; Else
8. For each cluster j (1<=j<=k), recalculate the centroids; until the criteria is met.
RESULTS
DA1 11
PAPER 4
The clustering techniques are the most important part of the data analysis and k-
means is the oldest and popular clustering technique used. The paper discusses
the traditional K-means algorithm with advantages and disadvantages of it. It also
includes researched on enhanced k-means proposed by various authors and it
also includes the techniques to improve traditional K-means for better accuracy
and efficiency. There are two area of concern for improving K-means; 1) is to
select initial centroids and 2) by assigning data points to nearest cluster by using
equations for calculating mean and distance between two data points. The time
complexity of the proposed K-means technique will be lesser that then the
traditional one with increase in accuracy and efficiency. The main purpose of the
article is to proposed techniques to enhance the techniques for deriving initial
centroids and the assigning of the data points to its nearest clusters. The
clustering technique proposed in this paper is enhancing the accuracy and time
complexity but it still needs some further improvements and in future it is also
viable to include efficient techniques for selecting value for initial clusters(k).
Experimental results show that the improved method can effectively improve the
speed of clustering and accuracy, reducing the computational complexity of the k-
means.
DA1 12
MODIFIED ALGORITHM
Part1: Determine initial centroids
Step1.4: Find the distance for each data point from mean value using Equation
(Equ).
IF
The Distance between the mean value is minimum then it will be stored
in
Then Divide datasets into k cluster points dont needs to move to other
clusters.
ESLE
Recalculate distance for each data point from mean value using Equation
(Equ) until divide datasets into k cluster
Step2.1: Calculate Distance from each data point to centroids and assign data
points to its nearest centroid to form clusters and stored values for each data.
Step2.3: Calculate distance from all centroids to each data point for all data
points.
IF
DA1 13
ESLE
From the distance calculated assign data point to its nearest centroid by
comparing distance from different centroids.
Step2.5: Calculate centroids for these new clusters again. Until The convergence
criterion met.
RESULTS
PAPER 5
MODIFIED ALGORITHM
Step 1:
Take a reference point (0,0). (If the data contain 2 attributes). If the data contains
3 attributes (0,0,0). For n attributes we have to consider a n-d point.
Step 2:
Calculate the distance of all the points from the reference point which you have
taken. The distance can be calculated by Eludian distance formula
Step 3:
Calculate the mean of the distance which we have calculated in step 2. M= ((X-
X1)^2+(Y-Y1)^2)/N M-mean N-Total points
DA1 15
Step 4:
E=D/N
Step 5:
Similarly we will go up to N.
Step 6:
X= (xi)/n
Y= (yi)/n
Step 7:
Now as we have got the centroid we have to repeat the steps of the traditional
method.
DA1 16
RESULTS