K-means & GMM
K-means & GMM
supervised learning
Supervised learning
• Supervised learning, as the name indicates, has the presence of a
supervisor as a teacher.
• Basically supervised learning is when we teach or train the machine
using data that is well labeled. Which means some data is already
tagged with the correct answer. After that, the machine is provided
with a new set of examples(data) so that the supervised learning
algorithm analyses the training data(set of training examples) and
produces a correct outcome from labeled data.
CO-2 Unsupervised learning and Semi-
supervised learning
• For instance, suppose you are given a basket filled with different kinds of
fruits. Now the first step is to train the machine with all the different fruits
one by one like this:
CO-2 Unsupervised learning and Semi-
supervised learning
• If the shape of the object is rounded and has a depression at the top, is
red in color, then it will be labeled as –Apple.
• If the shape of the object is a long curving cylinder having Green-Yellow
color, then it will be labeled as –Banana.
• Now suppose after training the data, you have given a new separate fruit,
say Banana from the basket, and asked to identify it.
CO-2 Unsupervised learning and Semi-
supervised learning
• Since the machine has already learned the things from previous data and this time
has to use it wisely. It will first classify the fruit with its shape and color and would
confirm the fruit name as BANANA and put it in the Banana category. Thus the
machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
• Supervised learning is classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category,
such as “Red” or “blue” , “disease” or “no disease”.
• Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight”.
• Supervised learning deals with or learns with “labeled” data. This implies that
some data is already tagged with the correct answer.
CO-2 Unsupervised learning and Semi-
supervised learning
Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine
• Advantages:-
• Supervised learning allows collecting data and produces data output from
previous experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world
computation problems.
CO-2 Unsupervised learning and Semi-
supervised learning
Disadvantages:-
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
• Unsupervised learning is the training of a machine using information that
is neither classified nor labeled and allowing the algorithm to act on that
information without guidance. Here the task of the machine is to group
unsorted information according to similarities, patterns, and differences
without any prior training of data.
• Unlike supervised learning, no teacher is provided that means no training
will be given to the machine. Therefore the machine is restricted to find
the hidden structure in unlabeled data by itself.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
• For instance, suppose it is given an image having both dogs and cats
which it has never seen.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
• Thus the machine has no idea about the features of dogs and cats so we
can’t categorize it as ‘dogs and cats ‘. But it can categorize them according
to their similarities, patterns, and differences, i.e., we can easily categorize
the above picture into two parts. The first may contain all pics
having dogs in them and the second part may contain all pics
having cats in them. Here you didn’t learn anything before, which means
no training data or examples.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
• It allows the model to work on its own to discover patterns and information
that was previously undetected. It mainly deals with unlabelled data.
• Unsupervised learning is classified into two categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
Types of Unsupervised Learning:-
• Clustering Types:-
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
CO-2 Unsupervised learning and Semi-
supervised learning
Clustering
• Clustering is the method of dividing objects into sets that are similar, and
dissimilar to the objects belonging to another set. There are two different
types of clustering, each divisible into two subsets.
Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped
into different sets based on their degree of similarity.
The various types of clustering are:
• Hierarchical clustering
• Partitioning clustering
CO-2 Unsupervised learning and Semi-
supervised learning
Hierarchical clustering is further subdivided into:
• Agglomerative clustering
• Divisive clustering
Partitioning clustering is further subdivided into:
• K-Means clustering
• Fuzzy C-Means clustering
CO-2 Unsupervised learning and Semi-
supervised learning
Hierarchical Clustering.
Step 4:
• Keep repeating step 2 and step 3 until convergence is achieved.
CO-2 Unsupervised learning and Semi-
supervised learning
Elbow Method
• The Elbow method is one of the most popular ways to find the optimal number
of clusters. This method uses the concept of WCSS value. WCSS stands
for Within Cluster Sum of Squares, which defines the total variations within a
cluster. The formula to calculate the value of WCSS (for 3 clusters) is given
below:
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values
(ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
CO-2 Unsupervised learning and Semi-
supervised learning
• Since the graph shows the sharp bend, which looks like an elbow, hence it is
known as the elbow method. The graph for the elbow method looks like the
below image:
CO-2 Unsupervised learning and Semi-
supervised learning
Note: We can choose the number of clusters equal to the given data points. If we
choose the number of clusters equal to the data points, then the value of WCSS
becomes zero, and that will be the endpoint of the plot
CO-2 Unsupervised learning and Semi-
supervised learning
Python Implementation of K-means Clustering Algorithm.
• Before implementation, let's understand what type of problem we will solve
here. So, we have a dataset of Mall_Customers, which is the data of customers
who visit the mall and spend there.
• In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($),
and Spending Score (which is the calculated value of how much a customer has
spent in the mall, the more the value, the more he has spent). From this dataset,
we need to calculate some patterns, as it is an unsupervised method, so we don't
know what to calculate exactly.
CO-2 Unsupervised learning and Semi-
supervised learning
The steps to be followed for the implementation are given below:
• Data Pre-processing
• Finding the optimal number of clusters using the elbow method
• Training the K-means algorithm on the training dataset
• Visualizing the clusters
• Step-1: Data pre-processing Step
• Importing Libraries
Firstly, we will import the libraries for our model, which is part of data pre-
processing. The code is given below:
CO-2 Unsupervised learning and Semi-
supervised learning
• numpy for the performing mathematics calculation
• matplotlib is for plotting the graph, and pandas are for managing the dataset.
• Importing the Dataset:
Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:
CO-2 Unsupervised learning and Semi-
supervised learning
CO-2 Unsupervised learning and Semi-
supervised learning
From the above dataset, we need to find some patterns in it.
Extracting Independent Variables
• Here we don't need any dependent variable for data pre-processing step as it is a
clustering problem, and we have no idea about what to determine. So we will
just add a line of code for the matrix of features.
Step-2: Finding the optimal number of clusters using the elbow method
• In the second step, we will try to find the optimal number of clusters for our
clustering problem. So, as discussed above, here we are going to use the elbow
method for this purpose.
• As we know, the elbow method uses the WCSS concept to draw the plot by
plotting WCSS values on the Y-axis and the number of clusters on the X-axis. So
we are going to calculate the value for WCSS for different k values ranging from 1
to 10. Below is the code for it:
CO-2 Unsupervised learning and Semi-
supervised learning
CO-2 Unsupervised learning and Semi-
supervised learning
• As we can see in the above code, we have used the KMeans class of sklearn.
cluster library to form the clusters.
• Next, we have created the wcss_list variable to initialize an empty list, which is
used to contain the value of wcss computed for different values of k ranging
from 1 to 10.
• After that, we have initialized the for loop for the iteration on a different value of
k ranging from 1 to 10; since for loop in Python, exclude the outbound limit, so it
is taken as 11 to include 10th value.
• The rest part of the code is similar as we did in earlier topics, as we have fitted
the model on a matrix of features and then plotted the graph between the
number of clusters and WCSS.
CO-2 Unsupervised learning and Semi-
supervised learning
From the above plot, we can see the elbow point is at 5.
• The first line is the same as above for creating the object of KMeans class.
• In the second line of code, we have created the dependent variable y_predict to
train the model.
CO-2 Unsupervised learning and Semi-
supervised learning
• By executing the above lines of code, we will get the y_predict variable. We can
check it under the variable explorer option in the Spyder IDE. We can now
compare the values of y_predict with our original dataset.
CO-2 Unsupervised learning and Semi-
supervised learning
• From the above image, we can now relate that the CustomerID 1 belongs to a
cluster 3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs
to cluster 4, and so on.
Step-4: Visualizing the Clusters
• The last step is to visualize the clusters. As we have 5 clusters for our model, so
we will visualize each cluster one by one.
• To visualize the clusters will use scatter plot using mtp.scatter() function of
matplotlib.
CO-2 Unsupervised learning and Semi-
supervised learning
CO-2 Unsupervised learning and Semi-
supervised learning
• In above lines of code, we have written code for each clusters, ranging from
1 to 5. The first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0]
containing the x value for the showing the matrix of features values, and the
y_predict is ranging from 0 to 1.
Output:
CO-2 Unsupervised learning and Semi-
supervised learning
• The output image is clearly showing the five different clusters with different
colors. The clusters are formed between two parameters of the dataset;
Annual income of customer and Spending. We can change the colors and
labels as per the requirement or choice. We can also observe some points
from the above patterns, which are given below:
• Cluster1 shows the customers with average salary and average spending so we
can categorize these customers as
• Cluster2 shows the customer has a high income but low spending, so we can
categorize them as careful.
• Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
• Cluster4 shows the customers with low income with very high spending so
they can be categorized as careless.
• Cluster5 shows the customers with high income and high spending so they can
be categorized as target, and these customers can be the most profitable
customers for the mall owner.
GMMS
• Gaussian mixture models (GMMs) are a type of machine learning algorithm.
They are used to classify data into different categories based on the probability
distribution.
What are Gaussian mixture models (GMM)?
• Gaussian mixture models are also relatively robust to outliers, meaning that
they can still yield accurate results even if there are some data points that do
not fit neatly into any of the clusters. This makes GMMs a flexible and
powerful tool for clustering data. It can be understood as a probabilistic model
where Gaussian distributions are assumed for each group and they have
means and covariance which define their parameters.
• GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ).
• In general, K-means will be faster and more accurate when the data set is large
and the clusters are well-separated. Gaussian mixture models will be more
accurate when the data set is small or the clusters are not well-separated.
• Gaussian mixture models take into account the variance of the data, whereas
K-means does not.
• Gaussian mixture models are more flexible in terms of the shape of the
clusters, whereas K-means is limited to spherical clusters.
• Gaussian mixture models can handle missing data, whereas K-means cannot.
This difference can make Gaussian mixture models more effective in certain
applications, such as data with a lot of noise or data that is not well-defined.
What are the scenarios when Gaussian mixture models can be used?
• Gaussian mixture models can be used for anomaly detection; by fitting a model
to a dataset and then scoring new data points, it is possible to flag points that
are significantly different from the rest of the data (i.e. outliers). This can be
useful for identifying fraud or detecting errors in data collection.
• In the case of time series analysis, GMMs can be used to discover how
volatility is related to trends and noise which can help predict future stock
prices. One cluster could consist of a trend in the time series while another can
have noise and volatility from other factors such as seasonality or external
events which affect the stock price. In order to separate out these clusters,
GMMs can be used because they provide a probability for each category
instead of simply dividing the data into two parts such as that in the case of K-
means.
What are the scenarios when Gaussian mixture models can be used?
• Another example is when there are different groups in a dataset and it’s hard
to label them as belonging to one group or another which makes it difficult for
other machine learning algorithms such as the K-means clustering algorithm to
separate out the data. GMMs can be used in this case because they find
Gaussian mixture models that best describe each group and provide a
probability for each cluster which is helpful when labeling clusters.
• Gaussian mixture models can generate synthetic data points that are similar to
the original data, they can also be used for data augmentation.
What are some real-world examples where Gaussian mixture models can be used?
• Gaussian mixture models are very useful when there are large datasets and it
is difficult to find clusters. This is where Gaussian mixture models help. It is
able to find clusters of Gaussians more efficiently than other clustering
algorithms such as k-means.
• Finding patterns in medical datasets: GMMs can be used for segmenting
images into multiple categories based on their content or finding specific
patterns in medical datasets. They can be used to find clusters of patients with
similar symptoms, identify disease subtypes, and even predict outcomes. In
one recent study, a Gaussian mixture model was used to analyze a dataset of
over 700,000 patient records. The model was able to identify previously
unknown patterns in the data, which could lead to better treatment for
patients with cancer.
What are some real-world examples where Gaussian mixture models can be used?
• There are two most common areas of Machine Learning – Supervised Learning
and Unsupervised Learning. We can easily distinguish between these two
types based on the nature of data they use and the approaches that go
towards solving the problems. In order to cluster the points based on similar
characteristics, we make use of the clustering algorithms. Let’s assume that we
have the following dataset –
Why do we need Gaussian Mixture Models?
• Our goal is to find the group of points that are close to each other. There are
two different groups that we will color as blue and red.
• One of the most popular clustering techniques is the K-means clustering
algorithm that follows an iterative approach to update parameters of each of
the clusters. We compute the means of each cluster with which we then
compute the means of each cluster and subsequent calculation of their
distance to each data-points. The algorithm then labels these data points by
identifying them by their closest centroid. The process is then repeated until
achievement of some conversion criterion.
Why do we need Gaussian Mixture Models?
• Data lies in the heart of Machine Learning and Data Science. During data
collection, we generally record multiple attributes for the sake of not-
loosing any critical information. For example, if there is a requirement of
collecting weather data, we record various attributes like humidity values,
atmospheric pressure, temperature, wind speed, etc. It may be possible
that we will not use all of them for the model building process. Still, we
collect all of them because if a project has some special requirements and
we don't have that attribute, we will have to start the collection process
from scratch.
PCA
• If we want to reduce the dimensionality, we can remove some PCs from the
last. Some information will get lost, but this loss will be insignificant as most
of the variation is retained by PCs present at the start. That's why we used
the term "much" earlier. Also, note that we correlate the term "variation"
with the term "information" as attributes having maximum variations
contain more information than others. So, in the last, if we need to
summarize, we can say that PCA is an unsupervised, non-parametric, and
statistical machine learning approach used to reduce dimensionality.
• The term PCA was first coined by Pearson in 1901 and later developed
independently by Hotelling in 1933. Although it originated from multivariate
data problems, now it has a broader range of applications like denoising
signals, blind source separation, and data compression.
PCA
https://pub.towardsai.net/basic-linear-algebra-for-deep-learning-and-machin
e-learning-ml-python-tutorial-444e23db3e9e
PCA
• It's the scenario when the rotating line meets the purple lines (4th image of
the most suitable case), which we call the case of maximum variance. So
our first PC will be along the line, which passes through the two purple
lines.
• Similarly, the same procedure will be followed for the second PC to find the
maximum variance but with an additional constraint that it should be
perpendicular to the first PC. Eigenvalues and Eigenvectors help us find the
same two dimensions. Eigenvectors are the unit vectors from the origin,
which correspond to the directions along which the PCs will lie. Eigenvalues
are the coefficients attached with eigenvectors that say about the variance
carried along with corresponding eigenvectors.
PCA
• For our dataset, eigenvalues and eigenvectors are given in the below
images. Every row in the eigenvector matrix corresponds to one
eigenvector. Please note that the magnitude of every eigenvector is 1 (unit
vector), and they are perpendicular to each other. To check their
orthogonality, we can perform dot product of vector_l = [-0.73517866, -
0.6778734]and vector_2 = [0.6778734, -0.73517866].If this product results
in zero, then these vectors must be orthogonal.
PCA
PCA
Note: We can choose to retain both dimensions as well. But to feel the
reduction, we have decided to ignore the second PC.
PCA
Step 5: Forming the newer feature set along the principal component axes
• Once we have decided the number of components we want to keep, we
need to form the matrix (also known as feature vector)with column
entries as eigen vectors. As we already know, every row of the
eigenvector matrix corresponds to one eigenvector. To form the feature
matrix, we need to place the transpose of selected eigenvectors as
column entries, e.g., [eigenvector/.T, Eigen vector2.T, ...., Eigen
vector_N.T]. Please note that we will use eigenvectors corresponding to
those PCs only that we have selected to keep. In our case, we decided to
keep 1 PC only to realize the reduction. Hence;
PCA
PCA
PCA
PCA
• The number of samples: Ideally, the number of samples in the data should
be five times the number of features present.
• Features are correlated: We assume that the features present in the
original set should be correlated so that the components formed after PCA
preserves the originality.
• No outliers are present: Outliers can affect the overall variance of the data,
and we know PCA gives importance to the features having high variance.
Standardization generally solves the problem of outlier presence in the
data, but removing outliers before applying standardization will be more
helpful.
PCA
Example:
• Let's assume we have to classify two different classes having two sets of data
points in a 2-dimensional plane as shown below image:
Linear Discriminant Analysis (LDA)
To create a new axis, Linear Discriminant Analysis uses the following criteria:
• It maximizes the distance between means of two classes.
Why LDA?
• Logistic Regression is one of the most popular classification algorithms that
perform well for binary classification but falls short in the case of multiple
classification problems with well-separated classes. At the same time, LDA
handles these quite efficiently.
• LDA can also be used in data pre-processing to reduce the number of features,
just as PCA, which reduces the computing cost significantly.
• LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to
extract useful data from different faces. Coupled with eigenfaces, it produces
effective results.
Linear Discriminant Analysis (LDA)
Applications of LDA
• Face recognition is the popular application of computer vision, where each
face is represented as the combination of a number of pixel values. In this
case, LDA is used to minimize the number of features to a manageable number
before going through the classification process. It generates a new template in
which each dimension consists of a linear combination of pixel values. If a
linear combination is generated using Fisher's linear discriminant, then it is
called Fisher's face.
• In the medical field, LDA has a great application in classifying the patient
disease on the basis of various parameters of patient health and the medical
treatment which is going on. On such parameters, it classifies disease as mild,
moderate, or severe. This classification helps the doctors in either increasing or
decreasing the pace of the treatment.
Linear Discriminant Analysis (LDA)
• Step-1:Compute the global mean(M) using the samples from patients and non-
patients.
• Step-2:Compute the statistics for patients.