Machine Learning Bloque 4
Machine Learning Bloque 4
So, if we know the complexity of an algorithm and the space it ocuppiesin memory, we could
improve our computer. If an algorithm is O(n) and we duplicate the CPU we solve it in half the
time
1
04 UNSUPERVISED LEARNING
Let Cj be the set of observations found in cluster j. Obviously, the set of clusters satisfies: 𝐶1 ∪
𝐶2 ∪ ⋯ ∪ 𝐶𝑘 = 1,2, … , 𝑛 𝑎𝑛𝑑 𝐶𝑖 ∩ 𝐶𝑗 = 0. That is, all observations are in some cluster and there
is no single observation in two different clusters. How can we know if one clustering is better or
worse than another?Which is the best k for kMeans?
• By business needs – We need to classify our customers en k cluster for some purpose.
• By number of measures – We could estimate k as Sqrt(n/2) for example.
• Using the Within Cluster Variations – We define a distance between data and assign
data to the nearest.
Within-cluster variation W(Ck ). Intuitively, if that measure is small, the observations will be very
“close" and our clustering will be good. Therefore, let's minimize W(Ck ) for all k, that is:
Let’s have a look at k (number of clusters): When increasing k we also reduce the WCV, but
makes no sense for our goal (get the data classified). How must we define k, using WCV?
2
04 UNSUPERVISED LEARNING
There are many options for defining W. We are going to use the Euclidean distance, so that our
clustering problem is as follows:
For K clusters and n observations, we have K n possible clusters: for 150 observations and 3
clusters, the options are 1071 different possible clusters. We cannot address it, even for such a
simple case. To fix it, let's use the following algorithm:
It can happen, as with any other minimization problem, that the algorithm finds a local
minimum. The solution is to run the algorithm several times with different initial data
assignments and that we stay with the one that gets a lower value for ΣW.
In the description of the previous algorithm, we calculated the new centroids once we assigned
all the data. This is known as Lloyd's algorithm (Lloyd k-mean). However, we would be more
precise if before assigning the next point we recalculated the centroids (MacQueen's algorithm).
If we have a lot of data, continuously calculating the distances of the data to the centroids can
be exhausting for the CPU. Elkan's algorithm makes use of the triangular inequality between a
point and different centroids, to avoid performing calculations. › We will use Kmeans of sklearns.
After initializing Kmeans, we have to invoke fit_predict to get the results.
SILHOUETTE METHOD
Suppose we have clustered a dataset with k-Means. We define the average distance, or
dissimilarity, of a point i, belonging to the cluster CK as the average distance between that point
and the rest of the points of its same cluster:
3
04 UNSUPERVISED LEARNING
Below we calculate the average distance, or dissimilarity, between that point and the rest of the
clusters:
That is, b(i) is the minimum average distance from i to another cluster other than your own. That
cluster L, for which b(i) is minimal, would be the second-best classification option for i.
And s(i)=0 if CK has only one element. From this formula we can see that:
4
04 UNSUPERVISED LEARNING
There are many ways to define a distance between observations. Perhaps the simplest and most
used is the Euclidean distance, but it is not the only one. In fact, we can use the following:
We pick first cluster (point a) and calculate its distance to all other clusters to find which one is
the nearest. Once found (suppose it is CB ), we merge both clusters into a single Cab .
Wow! We just have created a cluster. And our initial dendogram Will have the
following shape:
We continue with the next cluster (point C, f.i.). Buuuuut, we need to compute the distance to
other one-point-clusters and to 1 two-point-cluster (AB). How should be done this?
To obtain the dissimilarity between clusters we cannot use the Euclidean distance. Instead, we
define the linkage.
5
04 UNSUPERVISED LEARNING
Once decided the type of distance to be used, we continue with the rest of the 1-point clusters
and repeat again until we get only one cluster.
We repeat the algorithm until there is only one cluster left and represent the
result .
The way to generate a final clustering with the dendogram is to decide the
vertical point at which to cluster. For example, if we cut in y=9 we have two
clusters :
If we cut in 5 we have 3 clusters left: The term hierarchical refers to the fact
that we can establish a hierarchy (an order) in the data. If our data refers to
objects for which a measure of distance is unclear, hierarchical clustering is not
the best option.
6
04 UNSUPERVISED LEARNING
One of the problems of clustering is that we necessarily assign each to a cluster, which is not
always necessary. If, for example, we have a lot of noise, outliers, or measures for which it is not
yet convenient to define a cluster, we will have an unrealistic classification. There are mixed
methods of classification, such as soft K-means, that can help in this problem. See book Hastie.
Another problem is stability in the face of changes in observations. If we remove or add some
observations, clusters can undergo major alterations.
However, there are 𝑝/2 = (𝑝( 𝑝−1))/ 2 possible charts; for example, with p = 10 there are 45
possibilities! If p is large, it will not be possible to look at them all; in addition, and most likely
none provide information, since each of them contains only a small fraction of the total
information present in the data set. Clearly we need a better method to visualize n observations
when p is large. In particular, we would like to find a low-dimensional representation of the data
that captures as much information as possible..
PCA provides a tool to do just this. Find a low-dimensional representation of a dataset that
contains as much of the variability as possible. The underlying idea is that not all the observations
we have are equally interesting. PCA looks for a small number of dimensions that are as
representative as possible. Each of the dimensions found by PCA is a linear combination of the
characteristics p.
The first principal component of a set of features X1 , X2 ... Xp with zero mean is the normalized
(normalized, means that the norm is 1) linear combination of the features, that has the largest
variance:
7
04 UNSUPERVISED LEARNING
So, our objective is to find the linear combination of features that account for the largest
varianze. That is, find the:
Since X has 0 mean, also Z has 0 mean and the problema reduces to maximize variance
We then construct three matrixes: Similarity (S), Responsability (R), Availability (A).
Similarity Matrix: Its constructed using the negative distance between two observations (k,l)
For the diagonal we select the lowest value of the whole S matrix
(for having less clusters) of the máximum, for the opposite. Also, an
intermediate value can be chosen. Also, its recommended to use the
median of column values, mean, etc
We fill the diagonal with the lowest value of all the S Matrix:
▪ Availability Matrix: we begin with all elements of the matrix equal to zero 𝒂𝑎,𝑏 = 0
▪ Responsability Matrix: Then we evaluate the
responsability. As a=0 r is s
minus the máximum value
of the row
8
04 UNSUPERVISED LEARNING
So, a CF is a multi-vector (N,LS, SS) and can be added to another CF (N1+ N2 ,LS1+LS2 ,SS1+SS2 )
A CF Tree has a root node is comprised of some CF, pointing each of them to an inner CF node,
which also is built of some CF nodes, which also point every node to another CF node ….. Until
we reach the leafs (no more CF nodes). Features are stored on the leafs nodes. Inner and root
nodes holds only aggregates. The vector of any CF node equals the sum of the vector of his
bottom nodes, until the leaf’s:
9
04 UNSUPERVISED LEARNING
Step 1. 22
Step 2. 9
We try to insert in Node 1 and evaluate
statistics:
CF=(n,LS,SS) =(2,22+9,22²+9²)
=(2,31,565)
Step 3. 12
We look at the centroids. 12 is closest to the second node. We try to insert on it and evaluate
Diameter to compare with T CF=(n,LS,SS)=(2,9+12,9²+12²)=(2,21,225) and D=3 And update
statistics:
Step 4. 15
10
04 UNSUPERVISED LEARNING
We look at the centroids. 15 is closest to 10.5, the second node. We try to insert on it and
evaluate Diameter to compare with T CF=(n,LS,SS)=(3,21+15,225+15²)=(3,36,450) and D=4.24
And update statistics:
Step 5. 18
We look at the centroids. 18 is closest to 22, the first node. We try to insert on it and evaluate
Diameter to compare with T CF=(n,LS,SS)=(2,22+18,22²+18²)=(2,40,808) and D=4 And update
statistics:
Step 6. 27,
We look at the centroids. 27 is closest to 20, the first node. We try to insert on it and evaluate
Diameter to compare with T CF=(n,LS,SS)=(3,40+27,808+27²)=(3,67,1537) and D=6.37
We need to create a new node, but B=2 and we cannot add it to the root level➔ Splitting the
tree
Step 7. 11
We look at the centroids of the root level. 11 is closest to 12, the second node. We descend
down and select the unique leaf (3,36,450). Let’s try to insert on it and evaluate Diameter to
compare with T CF=(n,LS,SS)=(4,47,571) and D=3.53
And update statistics:
Step 8. 36
We look at the centroids of the root level. 36 is closest to 22.23, the first root node. We descend
down and look again the centroids. 36 is close to 26.
Lets try to insert on it and evaluate Diameter to compare with T CF=(n,LS,SS)=(2,63,2025) and
D=9.
But there is no space to create a new leaf node
(B=2), so we need to Split the node and create
a Branch. But, there is space on other same-
level nodes. We need to re-distribute the
nodes:
11
04 UNSUPERVISED LEARNING
Now, we look again at the centroid of the moved leaf cluster. CF=(n,LS,SS)=(2,63,2022) and
D=8.6
We add it to the free leave and update
statistics of the leave node and all upper
nodes!!
And if we have multi-featured data? Instead of a vector {N,LS,SS] we will a multivector: {{N1 ,
LS1 , SS1 }, {N2 , LS2 , SS1 },…{Nn , LSn , SS n }}
12