0% found this document useful (0 votes)
56 views12 pages

Machine Learning Bloque 4

This document discusses computational complexity and unsupervised learning techniques for clustering unlabeled data. It covers: 1) Computational complexity analyzes the resources like time and space needed to solve problems, and how algorithms can be improved by reducing their order of growth. 2) K-means clustering partitions observations into K clusters by minimizing distances between observations and assigned cluster centroids. The optimal number of clusters K can be estimated using the elbow method. 3) Hierarchical clustering builds a dendrogram tree of clusters without needing to predefine the number of clusters as in K-means. It groups observations sequentially based on their distances.

Uploaded by

Alba Morales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views12 pages

Machine Learning Bloque 4

This document discusses computational complexity and unsupervised learning techniques for clustering unlabeled data. It covers: 1) Computational complexity analyzes the resources like time and space needed to solve problems, and how algorithms can be improved by reducing their order of growth. 2) K-means clustering partitions observations into K clusters by minimizing distances between observations and assigned cluster centroids. The optimal number of clusters K can be estimated using the elbow method. 3) Hierarchical clustering builds a dendrogram tree of clusters without needing to predefine the number of clusters as in K-means. It groups observations sequentially based on their distances.

Uploaded by

Alba Morales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

04 UNSUPERVISED LEARNING

4.0 COMPUTATIONAL COMPLEXITY


We don’t have infinity resources to solve a ML problema. We don’t
have even infinite time to wait. That means that having the best
algorithm is not useful if its Will waste all of our resources or Will
last years in its execution. Computation Complexity is the part of
Computer Sciences that analyse the resources needed to solve a
problema. We need to consider two types of resources: Time (CPU
cycles of our compurter) and Space (memory employed in solving
the problema).

Ex1: sum of natural numbers.

Let’s analyse the resources in terms of Limite:

− We need 3 variables (X, i, Limite)


− We need to perform 1 + Limite*2 + 1 operations and grows with Limite

So, if we name the growing variables as “n”, our algorithm needs:

− 3+0·n → space in memory


− 2+2·n → CPU operations (time)

Our algorithm is of order O(n) and needs a space of O(1)

So, if we know the complexity of an algorithm and the space it ocuppiesin memory, we could
improve our computer. If an algorithm is O(n) and we duplicate the CPU we solve it in half the
time

Ex 2 – Finding roots of an equation 3·x + 7·y - 2·z =134

We need 1 + n*n*n (3 + 3 +1 +1)=1+8n3 = O(n3 )

If in our problema n is 10.000 ➔ 1012 operations. If each


operation last 1μs and we duplicate the CPU instead of
277 hours, we Will use 139. Is that aceptable?

1
04 UNSUPERVISED LEARNING

4.1 K-MEANS CLUSTERING


CLUSTERING: We have seen in the introduction and in the examples of data pre-processing that
normally the data is defined by a series of characteristics and values, which is the result of the
learning that we want to calculate. In the example of the ICU wine database, we saw that there
were three types of wine (1,2,3). However, this is not always the case. Many times, we have the
data without any label that tells us what it corresponds to. Normally this happens in the early
stages of a data analysis or mining, when we are not able to identify tags that group this data.
For example, if we have a bunch of photographs from the Internet without any classification
criteria, we can use unsupervised learning to try to group them in a coherent way.

• Clustering tries to look for homogeneous groupings in observations.


• One of the typical applications of clustering is customer segmentation in marketing. Almost
all companies store data on the behaviour of their customers (location, use of products,
spending, incidents, queries, payments, etc.), but they do not know very well what to do
with them and if all the data can be extracted lessons. One of the first things we can do is
clustering to classify the data so we can continue the analysis.
• There are many techniques in clustering, the two best known being K-means and
hierarchical clustering.

K-MEANS CLUSTERING: In K-means clustering we seek to


partition the observations into a pre-specied number of
clusters.

The operation of this technique is as follows: we create a number K of non-overlapping clusters


and assign each data (in principle observations) to a single cluster. In case the result is not
adequate, we can redistribute the data to another cluster, until we get a classification that meets
our criteria.

Let Cj be the set of observations found in cluster j. Obviously, the set of clusters satisfies: 𝐶1 ∪
𝐶2 ∪ ⋯ ∪ 𝐶𝑘 = 1,2, … , 𝑛 𝑎𝑛𝑑 𝐶𝑖 ∩ 𝐶𝑗 = 0. That is, all observations are in some cluster and there
is no single observation in two different clusters. How can we know if one clustering is better or
worse than another?Which is the best k for kMeans?
• By business needs – We need to classify our customers en k cluster for some purpose.
• By number of measures – We could estimate k as Sqrt(n/2) for example.
• Using the Within Cluster Variations – We define a distance between data and assign
data to the nearest.

Within-cluster variation W(Ck ). Intuitively, if that measure is small, the observations will be very
“close" and our clustering will be good. Therefore, let's minimize W(Ck ) for all k, that is:

Let’s have a look at k (number of clusters): When increasing k we also reduce the WCV, but
makes no sense for our goal (get the data classified). How must we define k, using WCV?

2
04 UNSUPERVISED LEARNING

There are many options for defining W. We are going to use the Euclidean distance, so that our
clustering problem is as follows:

For K clusters and n observations, we have K n possible clusters: for 150 observations and 3
clusters, the options are 1071 different possible clusters. We cannot address it, even for such a
simple case. To fix it, let's use the following algorithm:

1. We choose a value for K (number of clusters)


2. We assign each observation to a cluster (a value of K)
3. In each of the K clusters we calculate the centroid (vector
of means of the characteristic p)
4. We assign each observation that has the nearest centroid,
according to its Euclidean distance
5. We repeat steps 3 and 4 until we get that there are no
more data changes between clusters, or we get an
acceptable result according to some other criterion.

It can happen, as with any other minimization problem, that the algorithm finds a local
minimum. The solution is to run the algorithm several times with different initial data
assignments and that we stay with the one that gets a lower value for ΣW.

In the description of the previous algorithm, we calculated the new centroids once we assigned
all the data. This is known as Lloyd's algorithm (Lloyd k-mean). However, we would be more
precise if before assigning the next point we recalculated the centroids (MacQueen's algorithm).

If we have a lot of data, continuously calculating the distances of the data to the centroids can
be exhausting for the CPU. Elkan's algorithm makes use of the triangular inequality between a
point and different centroids, to avoid performing calculations. › We will use Kmeans of sklearns.
After initializing Kmeans, we have to invoke fit_predict to get the results.

THE ELBOW METHOD


One of the drawbacks of the k-Means method is that we have to
set k without knowing what the most appropriate value is. To
help us at this point we can use the elbow method, which
consists of running the algorithm for different values of k and
calculating the internal distortion, defined as the sum of the
squares of the distance from the data to the nearest centroid
(inertia). represent it graphically. As we can see, k=3 is a good
value.

SILHOUETTE METHOD
Suppose we have clustered a dataset with k-Means. We define the average distance, or
dissimilarity, of a point i, belonging to the cluster CK as the average distance between that point
and the rest of the points of its same cluster:

3
04 UNSUPERVISED LEARNING

Below we calculate the average distance, or dissimilarity, between that point and the rest of the
clusters:

And now we calculate the smallest of those numbers for each i:

That is, b(i) is the minimum average distance from i to another cluster other than your own. That
cluster L, for which b(i) is minimal, would be the second-best classification option for i.

Finally, we define the silhouette of a data i as:

And s(i)=0 if CK has only one element. From this formula we can see that:

Let us now look at the extreme cases. For example, if


a(i)<>b(i) then the opposite happens: element i would be
much better classified in cluster L. Finally, if both distances
are equal, the a(i) could belong to either cluster.

4.2 HIERARCHICAL CLUSTERING


One of the disadvantages of k-Means clustering is that we have to define in advance the number
of clusters we want and perform various tests to see which is the number that best suits our
model.

Hierarchical clustering is another clustering method in which information is represented in a tree


called a dendogram. Let's first analyze the dendogram generation algorithm and then its
interpretation.

DENDOGRAM A dendogram is a representation of observations (horizontal


axis) and their clustering (vertical axis): The observations are represented in
green, as if they were the endings of the filaments of the dendrites of a
neuron, while the groupings of observations are represented on the vertical
axis as the tree of dendrites, until they reach the nucleus of the neuron.

4
04 UNSUPERVISED LEARNING

There are many ways to define a distance between observations. Perhaps the simplest and most
used is the Euclidean distance, but it is not the only one. In fact, we can use the following:

In the following image we see three sets of observations on 20


variables. Observations 1 and 3 have similar values and
therefore a lower Euclidean distance than with 2, however they
are poorly correlated.

CLUSTER LINKAGE Suppose we have N observations. The first step


is to define some kind of distance between observations
(dissimilarity), which is usually the Euclidean distance, but we can
also use Euclidean square, Manhattan distance, Chebyshev,
Mahalanobis, Hamming codes, or Levenshtein.

At the beginning of the algorithm, each observation (n) is its own


cluster. And this is our initial dendrogram.

We pick first cluster (point a) and calculate its distance to all other clusters to find which one is
the nearest. Once found (suppose it is CB ), we merge both clusters into a single Cab .

Wow! We just have created a cluster. And our initial dendogram Will have the
following shape:

We continue with the next cluster (point C, f.i.). Buuuuut, we need to compute the distance to
other one-point-clusters and to 1 two-point-cluster (AB). How should be done this?

To obtain the dissimilarity between clusters we cannot use the Euclidean distance. Instead, we
define the linkage.

5
04 UNSUPERVISED LEARNING

Once decided the type of distance to be used, we continue with the rest of the 1-point clusters
and repeat again until we get only one cluster.

In the following image we can see different types of


linkage. The complete that measures the maximum
distances is the most balanced.

We repeat the algorithm until there is only one cluster left and represent the
result .

As we can see, we have a representation of the observations (final axons, in


green) and various unions by proximity between the different clusters
(dendrites).

The way to generate a final clustering with the dendogram is to decide the
vertical point at which to cluster. For example, if we cut in y=9 we have two
clusters :

If we cut in 5 we have 3 clusters left: The term hierarchical refers to the fact
that we can establish a hierarchy (an order) in the data. If our data refers to
objects for which a measure of distance is unclear, hierarchical clustering is not
the best option.

Decisions when using k-Means clustering


There are many decisions we have to make when applying k-Means to a set of observations. To
illustrate this, we are going to assume that we are analyzing the shopping cart of some
supermarkets and that we have the data of what each customer has bought. I mean:

6
04 UNSUPERVISED LEARNING

1. Standardization – When there is data of very different values we may be interested in


standardizing the data. For example, if we are going to consider in our analysis the expense
per customer, the fact that a customer buys a toothbrush can change everything.
2. Dissimilarity – What metric do we have to use? If we use the number of elements as an
indicator, the distances between products will be very different from if we use total
expenditure (e.g. Coca-Cola).
3. Linkage – To obtain the right linkage, we will have to perform many tests, or use sets of tests
that allow us to determine the most appropriate method to the problem we want to solve.
4. Number of clusters - We've seen this before, both in k-Means and in the cut in the
dendograms.

One of the problems of clustering is that we necessarily assign each to a cluster, which is not
always necessary. If, for example, we have a lot of noise, outliers, or measures for which it is not
yet convenient to define a cluster, we will have an unrealistic classification. There are mixed
methods of classification, such as soft K-means, that can help in this problem. See book Hastie.

Another problem is stability in the face of changes in observations. If we remove or add some
observations, clusters can undergo major alterations.

4.3 PCA ANALYSIS (fotos)


Principal components analysis (PCA) refers to the process by which principal components
análisis. Principal components are computed, and the subsequent use of these components in
understanding the data. PCA is an unsupervised approach, since it involves only a set of features
X1;X2; : : : ;Xp, and no associated response Y . Apart from producing derived variables for use in
supervised learning problems, PCA also serves as a tool for data visualization (visualization of
the observations or visualization of the variables). It can also be used as a tool for data
imputation, that is, for lling in missing values in a data matrix.

Suppose we want to visualize n observations with measurements in a set of characteristics p (X1


, X2 ... Xp ) as part of an exploratory data analysis. We could do this by examining two-
dimensional scatter plots of the data, each of which contains measurements of n observations
on two of the features.

However, there are 𝑝/2 = (𝑝( 𝑝−1))/ 2 possible charts; for example, with p = 10 there are 45
possibilities! If p is large, it will not be possible to look at them all; in addition, and most likely
none provide information, since each of them contains only a small fraction of the total
information present in the data set. Clearly we need a better method to visualize n observations
when p is large. In particular, we would like to find a low-dimensional representation of the data
that captures as much information as possible..

PCA provides a tool to do just this. Find a low-dimensional representation of a dataset that
contains as much of the variability as possible. The underlying idea is that not all the observations
we have are equally interesting. PCA looks for a small number of dimensions that are as
representative as possible. Each of the dimensions found by PCA is a linear combination of the
characteristics p.

The first principal component of a set of features X1 , X2 ... Xp with zero mean is the normalized
(normalized, means that the norm is 1) linear combination of the features, that has the largest
variance:

7
04 UNSUPERVISED LEARNING

So, our objective is to find the linear combination of features that account for the largest
varianze. That is, find the:

Subject to the normalization constraint:

Since X has 0 mean, also Z has 0 mean and the problema reduces to maximize variance

PCA Analysis can also be used to compress images.


1) We Split the image in channels, do a PCA of 20 components, and apply channel
transformation
2) Then we stack the channel data to conform the image and sabe
3) Images for original, PCA=20, 50, 100 and 200
4.4 AFFINITY PROPAGATION
This method has the advantage of no a-prioristic decisión of the number of clusters (k). Its
based on the distance (euclidean) among the different observations (negative distances).

We then construct three matrixes: Similarity (S), Responsability (R), Availability (A).

Similarity Matrix: Its constructed using the negative distance between two observations (k,l)

For the diagonal we select the lowest value of the whole S matrix
(for having less clusters) of the máximum, for the opposite. Also, an
intermediate value can be chosen. Also, its recommended to use the
median of column values, mean, etc

We fill the diagonal with the lowest value of all the S Matrix:
▪ Availability Matrix: we begin with all elements of the matrix equal to zero 𝒂𝑎,𝑏 = 0
▪ Responsability Matrix: Then we evaluate the
responsability. As a=0 r is s
minus the máximum value
of the row

Availability Matrix – We compute this matrix with the following formulas:

8
04 UNSUPERVISED LEARNING

Now, we construct the Criterion Matrix –

4.5 BIRCH ANALYSIS


Balanced Iterative Reducing and Clustering using
Hierarchies (BIRCH) is a clustering algorithm
(unsupervised) very suitable for large amounts of
data.

Main difference with other algorithms is that it


doesn’t need to load into memory all the data. Its also
used in data mining combined with other clustering
methods.
It uses a tree like structure called Clustering Features
Tree (CFT) and each node of the tree contains several
Clustering Features (CF) leaves, or inner nodes.
Each CF has the form of (N, LS, SS)

▪ N – Number of features of the CF


▪ LS – Is a vector of the sum of features of the members of the
cluster.
▪ SS – Is a vector of the sum of squares of the feature of the
members of the cluster

So, a CF is a multi-vector (N,LS, SS) and can be added to another CF (N1+ N2 ,LS1+LS2 ,SS1+SS2 )
A CF Tree has a root node is comprised of some CF, pointing each of them to an inner CF node,
which also is built of some CF nodes, which also point every node to another CF node ….. Until
we reach the leafs (no more CF nodes). Features are stored on the leafs nodes. Inner and root
nodes holds only aggregates. The vector of any CF node equals the sum of the vector of his
bottom nodes, until the leaf’s:

CF1= CF7+···+ CF12

CF7= CF90+···+ CF94

Three important parameters:


• › B = Branching factor - maximum CF number in each internal node
• › L = maximum CF number in each leaf
• › T = máximum sample diameter threshold of each CF in each node.

9
04 UNSUPERVISED LEARNING

Also, we can define some statistics for every cluster.

HOW TO CREATE A TREE?


1. Define initial parameters: B, L and T
2. Populate the first CF (leaf node) with the first feature, and update all statistics (N, LS, SS,
X0 , R, D)
3. Try to add the next feature:
4. Look at the root nodes and select the node with centroid closer to the feature value.
5. Descend through the tree and select the appropriate leaf (centroid next to the feature)
6. Evaluate the diameter of the node.
7. If its less than T, add it to the node and update statistics (L, LS, SS, X0 ) of all branch
8. If its greater than T, you should create a new node on the same level and update.
9. If n=B an is not possible to add a new feature, create a new node and update statistics
10. To avoid the tree gets unbalanced, redistribute the nodes.

EXAMPLE: Consider the following set: {22,9,12,15,18,27,11,36,10,3,14,32} B=2 ; L=5 ; T=5

Step 1. 22

Step 2. 9
We try to insert in Node 1 and evaluate
statistics:
CF=(n,LS,SS) =(2,22+9,22²+9²)
=(2,31,565)

13 > T ➔ We cannot insert 9 into the first node, so we créate a new


same-level node (leaf node).

Step 3. 12
We look at the centroids. 12 is closest to the second node. We try to insert on it and evaluate
Diameter to compare with T CF=(n,LS,SS)=(2,9+12,9²+12²)=(2,21,225) and D=3 And update
statistics:

Step 4. 15

10
04 UNSUPERVISED LEARNING

We look at the centroids. 15 is closest to 10.5, the second node. We try to insert on it and
evaluate Diameter to compare with T CF=(n,LS,SS)=(3,21+15,225+15²)=(3,36,450) and D=4.24
And update statistics:

Step 5. 18
We look at the centroids. 18 is closest to 22, the first node. We try to insert on it and evaluate
Diameter to compare with T CF=(n,LS,SS)=(2,22+18,22²+18²)=(2,40,808) and D=4 And update
statistics:

Step 6. 27,
We look at the centroids. 27 is closest to 20, the first node. We try to insert on it and evaluate
Diameter to compare with T CF=(n,LS,SS)=(3,40+27,808+27²)=(3,67,1537) and D=6.37
We need to create a new node, but B=2 and we cannot add it to the root level➔ Splitting the
tree

Step 7. 11
We look at the centroids of the root level. 11 is closest to 12, the second node. We descend
down and select the unique leaf (3,36,450). Let’s try to insert on it and evaluate Diameter to
compare with T CF=(n,LS,SS)=(4,47,571) and D=3.53
And update statistics:

Step 8. 36

We look at the centroids of the root level. 36 is closest to 22.23, the first root node. We descend
down and look again the centroids. 36 is close to 26.
Lets try to insert on it and evaluate Diameter to compare with T CF=(n,LS,SS)=(2,63,2025) and
D=9.
But there is no space to create a new leaf node
(B=2), so we need to Split the node and create
a Branch. But, there is space on other same-
level nodes. We need to re-distribute the
nodes:

11
04 UNSUPERVISED LEARNING

Now, we look again at the centroid of the moved leaf cluster. CF=(n,LS,SS)=(2,63,2022) and
D=8.6
We add it to the free leave and update
statistics of the leave node and all upper
nodes!!

We can continue with this procedure until we get:

- Clusters are the leave nodes of the tree.


- After creating the tree, we get all the
clusters.
- To add new data, we don’t need to look
at the whole tree.
- Neither need we to revaluate
parameters of all the tree, only from
the affected nodes
- The complexity of this method is O(n)
- We can categorize data flows on a
continuous way with BIRCH.

And if we have multi-featured data? Instead of a vector {N,LS,SS] we will a multivector: {{N1 ,
LS1 , SS1 }, {N2 , LS2 , SS1 },…{Nn , LSn , SS n }}

12

You might also like