0% found this document useful (0 votes)

8 views193 pages

K-means & GMM

The document discusses supervised and unsupervised learning in machine learning. Supervised learning involves training a model on labeled data to predict outcomes, while unsupervised learning deals with unlabeled data, allowing the model to identify patterns and groupings on its own. It also covers various algorithms, advantages, disadvantages, and specific methods like K-Means clustering and the Elbow method for determining optimal clusters.

Uploaded by

Kushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views193 pages

K-means & GMM

Uploaded by

Kushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 193

CO-2 Unsupervised learning and Semi-

supervised learning
Supervised learning
• Supervised learning, as the name indicates, has the presence of a
supervisor as a teacher.
• Basically supervised learning is when we teach or train the machine
using data that is well labeled. Which means some data is already
tagged with the correct answer. After that, the machine is provided
with a new set of examples(data) so that the supervised learning
algorithm analyses the training data(set of training examples) and
produces a correct outcome from labeled data.
CO-2 Unsupervised learning and Semi-
supervised learning
• For instance, suppose you are given a basket filled with different kinds of
fruits. Now the first step is to train the machine with all the different fruits
one by one like this:
CO-2 Unsupervised learning and Semi-
supervised learning
• If the shape of the object is rounded and has a depression at the top, is
red in color, then it will be labeled as –Apple.
• If the shape of the object is a long curving cylinder having Green-Yellow
color, then it will be labeled as –Banana.
• Now suppose after training the data, you have given a new separate fruit,
say Banana from the basket, and asked to identify it.
CO-2 Unsupervised learning and Semi-
supervised learning
• Since the machine has already learned the things from previous data and this time
has to use it wisely. It will first classify the fruit with its shape and color and would
confirm the fruit name as BANANA and put it in the Banana category. Thus the
machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
• Supervised learning is classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category,
such as “Red” or “blue” , “disease” or “no disease”.
• Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight”.
• Supervised learning deals with or learns with “labeled” data. This implies that
some data is already tagged with the correct answer.
CO-2 Unsupervised learning and Semi-
supervised learning
Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine
• Advantages:-
• Supervised learning allows collecting data and produces data output from
previous experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world
computation problems.
CO-2 Unsupervised learning and Semi-
supervised learning
Disadvantages:-
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
• Unsupervised learning is the training of a machine using information that
is neither classified nor labeled and allowing the algorithm to act on that
information without guidance. Here the task of the machine is to group
unsorted information according to similarities, patterns, and differences
without any prior training of data.
• Unlike supervised learning, no teacher is provided that means no training
will be given to the machine. Therefore the machine is restricted to find
the hidden structure in unlabeled data by itself.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
• For instance, suppose it is given an image having both dogs and cats
which it has never seen.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
• Thus the machine has no idea about the features of dogs and cats so we
can’t categorize it as ‘dogs and cats ‘. But it can categorize them according
to their similarities, patterns, and differences, i.e., we can easily categorize
the above picture into two parts. The first may contain all pics
having dogs in them and the second part may contain all pics
having cats in them. Here you didn’t learn anything before, which means
no training data or examples.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
• It allows the model to work on its own to discover patterns and information
that was previously undetected. It mainly deals with unlabelled data.
• Unsupervised learning is classified into two categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
CO-2 Unsupervised learning and Semi-
supervised learning
Unsupervised Learning:
Types of Unsupervised Learning:-
• Clustering Types:-
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
CO-2 Unsupervised learning and Semi-
supervised learning
Clustering
• Clustering is the method of dividing objects into sets that are similar, and
dissimilar to the objects belonging to another set. There are two different
types of clustering, each divisible into two subsets.
Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped
into different sets based on their degree of similarity.
The various types of clustering are:
• Hierarchical clustering
• Partitioning clustering
CO-2 Unsupervised learning and Semi-
supervised learning
Hierarchical clustering is further subdivided into:
• Agglomerative clustering
• Divisive clustering
Partitioning clustering is further subdivided into:
• K-Means clustering
• Fuzzy C-Means clustering
CO-2 Unsupervised learning and Semi-
supervised learning
Hierarchical Clustering.

Hierarchical clustering uses a tree-like structure, like so:

CO-2 Unsupervised learning and Semi-
supervised learning
• Hierarchical clustering is separating data into groups based on some
measure of similarity, finding a way to measure how they’re alike and
different, and further narrowing down the data.
• Let's consider that we have a set of cars and we want to group similar
ones together. Look at the image shown below:
CO-2 Unsupervised learning and Semi-
supervised learning
• In agglomerative clustering, there is a bottom-up approach. We begin
with each element as a separate cluster and merge them into successively
more massive clusters, as shown below:
CO-2 Unsupervised learning and Semi-
supervised learning
• Divisive clustering is a top-down approach. We begin with the whole set
and proceed to divide it into successively smaller clusters.
CO-2 Unsupervised learning and Semi-
supervised learning
Partitioning Clustering
• Partitioning clustering is split into two subtypes - K-Means clustering and
Fuzzy C-Means.
• In k-means clustering, the objects are divided into several clusters
mentioned by the number ‘K.’ So if we say K = 2, the objects are divided
into two clusters, c1 and c2, as shown:
• Here, the features or characteristics are compared, and all objects having
similar characteristics are clustered together.
CO-2 Unsupervised learning and Semi-
supervised learning
• Fuzzy c-means is very similar to k-means in the sense that it clusters
objects that have similar characteristics together. In k-means clustering, a
single object cannot belong to two different clusters. But in c-means,
objects can belong to more than one cluster, as shown.
CO-2 Unsupervised learning and Semi-
supervised learning
K-Means Clustering Algorithm
• K-Means clustering is an unsupervised learning algorithm. There is no
labeled data for this clustering, unlike in supervised learning. K-Means
performs the division of objects into clusters that share similarities and
are dissimilar to the objects belonging to another cluster.
• The term ‘K’ is a number. You need to tell the system how many clusters
you need to create. For example, K = 2 refers to two clusters. There is a
way of finding out what is the best or optimum value of K for a given
data.
• For a better understanding of k-means, let's take an example from cricket.
Imagine you received data on a lot of cricket players from all over the
world, which gives information on the runs scored by the player and the
wickets taken by them in the last ten matches. Based on this information,
we need to group the data into two clusters, namely batsman and
bowlers.
CO-2 Unsupervised learning and Semi-
supervised learning
Applications of K-Means Clustering
Academic Performance
• Based on the scores, students are categorized into grades like A, B, or C.
Diagnostic systems
• The medical profession uses k-means in creating smarter medical decision
support systems, especially in the treatment of liver ailments.
Search engines
• Clustering forms a backbone of search engines. When a search is
performed, the search results need to be grouped, and the search engines
very often use clustering to do this.
Wireless sensor networks
• The clustering algorithm plays the role of finding the cluster heads, which
collect all the data in its respective cluster.
CO-2 Unsupervised learning and Semi-
supervised learning
Distance Measure
Distance measure determines the similarity between two elements and
influences the shape of clusters. K-Means clustering supports various
kinds of distance measures, such as:
Euclidean distance measure
• The most common case is determining the distance between two points.
If we have a point P and point Q, the Euclidean distance is an ordinary
straight line. It is the distance between the two points in Euclidean space.
CO-2 Unsupervised learning and Semi-
supervised learning
A squared Euclidean distance measure
• This is identical to the Euclidean distance measurement but does not take
the square root at the end. The formula is shown below:

Manhattan distance measure

• The Manhattan distance is the simple sum of the horizontal and vertical
components or the distance between two points measured along axes at
right angles.
• Note that we are taking the absolute value so that the negative values
don't come into play.
CO-2 Unsupervised learning and Semi-
supervised learning
• The formula is shown below:

Cosine distance measure

• Angle between the two vectors formed by joining the origin point. The formula
is shown below:
CO-2 Unsupervised learning and Semi-
supervised learning
CO-2 Unsupervised learning and Semi-
supervised learning
• The goal of the K-Means algorithm is to find clusters in the given input data.
There are a couple of ways to accomplish this. We can use the trial and error
method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep
changing the value until we get the best clusters.
• Another method is to use the Elbow technique to determine the value of K.
Once we get the K's value, the system will assign that many centroids
randomly and measure the distance of each of the data points from these
centroids. Accordingly, it assigns those points to the corresponding centroid
from which the distance is minimum. So each data point will be assigned to
the centroid, which is closest to it. Thereby we have a K number of initial
clusters.
• For the newly formed clusters, it calculates the new centroid position. The
position of the centroid moves compared to the randomly allocated one.
CO-2 Unsupervised learning and Semi-
supervised learning
• Once again, the distance of each point is measured from this new centroid
point. If required, the data points are relocated to the new centroids, and the
mean position or the new centroid is calculated once again.
• If the centroid moves, the iteration continues indicating no convergence. But
once the centroid stops moving (which means that the clustering process has
converged), it will reflect the result.
CO-2 Unsupervised learning and Semi-
supervised learning
• We have a data set for a grocery shop, and we want to find out how many
clusters this has to be spread across. To find the optimum number of clusters,
we break it down into the following steps:
Step 1:
• The Elbow method is the best way to find the number of clusters. The elbow
method constitutes running K-Means clustering on the dataset.
• Next, we use within-sum-of-squares as a measure to find the optimum
number of clusters that can be formed for a given data set. Within the sum of
squares (WSS) is defined as the sum of the squared distance between each
member of the cluster and its centroid.
CO-2 Unsupervised learning and Semi-
supervised learning
• The WSS is measured for each value of K. The value of K, which has the least
amount of WSS, is taken as the optimum value.
• Now, we draw a curve between WSS and the number of clusters.
• Here, WSS is on the y-axis and number of clusters on the x-axis.
• You can see that there is a very gradual change in the value of WSS as the K
value increases from 2.
• So, you can take the elbow point value as the optimal value of K. It should be
either two, three, or at most four. But, beyond that, increasing the number of
clusters does not dramatically change the value in WSS, it gets stabilized.
CO-2 Unsupervised learning and Semi-
supervised learning
Step 2:
• Let's assume that these are our delivery points:
• We can randomly initialize two points called the cluster centroids.
• Here, C1 and C2 are the centroids assigned randomly.
CO-2 Unsupervised learning and Semi-
supervised learning
Step 3:
• Now the distance of each location from the centroid is measured, and each
data point is assigned to the centroid, which is closest to it.
• This is how the initial grouping is done:
CO-2 Unsupervised learning and Semi-
supervised learning
Step 4:
• Compute the actual centroid of data points for the first group.
Step 5:
• Reposition the random centroid to the actual centroid.
CO-2 Unsupervised learning and Semi-
supervised learning
Step 6:
• Compute the actual centroid of data points for the second group.
Step 7:
• Reposition the random centroid to the actual centroid.
CO-2 Unsupervised learning and Semi-
supervised learning
Step 8:
• Once the cluster becomes static, the k-means algorithm is said to be
converged.
• The final cluster with centroids c1 and c2 is as shown below:
CO-2 Unsupervised learning and Semi-
supervised learning
K-Means Clustering Algorithm
• Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this
into K clusters.

The steps to form clusters are:

• Step 1: Choose K random points as cluster centers called centroids.

• Step 2: Assign each x(i) to the closest cluster by implementing Euclidean

distance (i.e., calculating its distance to each centroid)
• Step 3: Identify new centroids by taking the average of the assigned points.

• Step 4: Keep repeating step 2 and step 3 until convergence is achieved

CO-2 Unsupervised learning and Semi-
supervised learning
• Let's take a detailed look at it at each of these steps.
Step 1:
• We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say
that
Where C is the set of all centroids.
Step 2:
• We assign each data point to its nearest center, which is accomplished by
calculating the Euclidean distance.
• Where dist() is the Euclidean distance.
• Here, we calculate each x value's distance from each c value, i.e. the distance
between x1-c1, x1-c2, x1-c3, and so on. Then we find which is the lowest value
and assign x1 to that particular centroid.
• Similarly, we find the minimum distance for x2, x3, etc.
CO-2 Unsupervised learning and Semi-
supervised learning
Step 3:
• We identify the actual centroid by taking the average of all the points assigned
to that cluster.
• Where Si is the set of all points assigned to the ith cluster.
• It means the original point, which we thought was the centroid, will shift to the
new position, which is the actual centroid for each of these groups.

Step 4:
• Keep repeating step 2 and step 3 until convergence is achieved.
CO-2 Unsupervised learning and Semi-
supervised learning
Elbow Method
• The Elbow method is one of the most popular ways to find the optimal number
of clusters. This method uses the concept of WCSS value. WCSS stands
for Within Cluster Sum of Squares, which defines the total variations within a
cluster. The formula to calculate the value of WCSS (for 3 clusters) is given
below:

• In the above formula of WCSS,

• ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.
CO-2 Unsupervised learning and Semi-
supervised learning
To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values
(ranges from 1-10).
• For each value of K, calculates the WCSS value.

• Plots a curve between calculated WCSS values and the number of clusters K.

• The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
CO-2 Unsupervised learning and Semi-
supervised learning
• Since the graph shows the sharp bend, which looks like an elbow, hence it is
known as the elbow method. The graph for the elbow method looks like the
below image:
CO-2 Unsupervised learning and Semi-
supervised learning
Note: We can choose the number of clusters equal to the given data points. If we
choose the number of clusters equal to the data points, then the value of WCSS
becomes zero, and that will be the endpoint of the plot
CO-2 Unsupervised learning and Semi-
supervised learning
Python Implementation of K-means Clustering Algorithm.
• Before implementation, let's understand what type of problem we will solve
here. So, we have a dataset of Mall_Customers, which is the data of customers
who visit the mall and spend there.
• In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($),
and Spending Score (which is the calculated value of how much a customer has
spent in the mall, the more the value, the more he has spent). From this dataset,
we need to calculate some patterns, as it is an unsupervised method, so we don't
know what to calculate exactly.
CO-2 Unsupervised learning and Semi-
supervised learning
The steps to be followed for the implementation are given below:
• Data Pre-processing
• Finding the optimal number of clusters using the elbow method
• Training the K-means algorithm on the training dataset
• Visualizing the clusters
• Step-1: Data pre-processing Step
• Importing Libraries
Firstly, we will import the libraries for our model, which is part of data pre-
processing. The code is given below:
CO-2 Unsupervised learning and Semi-
supervised learning
• numpy for the performing mathematics calculation
• matplotlib is for plotting the graph, and pandas are for managing the dataset.
• Importing the Dataset:
Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:
CO-2 Unsupervised learning and Semi-
supervised learning
CO-2 Unsupervised learning and Semi-
supervised learning
From the above dataset, we need to find some patterns in it.
Extracting Independent Variables
• Here we don't need any dependent variable for data pre-processing step as it is a
clustering problem, and we have no idea about what to determine. So we will
just add a line of code for the matrix of features.

Step-2: Finding the optimal number of clusters using the elbow method
• In the second step, we will try to find the optimal number of clusters for our
clustering problem. So, as discussed above, here we are going to use the elbow
method for this purpose.
• As we know, the elbow method uses the WCSS concept to draw the plot by
plotting WCSS values on the Y-axis and the number of clusters on the X-axis. So
we are going to calculate the value for WCSS for different k values ranging from 1
to 10. Below is the code for it:
CO-2 Unsupervised learning and Semi-
supervised learning
CO-2 Unsupervised learning and Semi-
supervised learning
• As we can see in the above code, we have used the KMeans class of sklearn.
cluster library to form the clusters.
• Next, we have created the wcss_list variable to initialize an empty list, which is
used to contain the value of wcss computed for different values of k ranging
from 1 to 10.
• After that, we have initialized the for loop for the iteration on a different value of
k ranging from 1 to 10; since for loop in Python, exclude the outbound limit, so it
is taken as 11 to include 10th value.
• The rest part of the code is similar as we did in earlier topics, as we have fitted
the model on a matrix of features and then plotted the graph between the
number of clusters and WCSS.
CO-2 Unsupervised learning and Semi-
supervised learning
From the above plot, we can see the elbow point is at 5.

So the number of clusters here will be 5.

CO-2 Unsupervised learning and Semi-
supervised learning
CO-2 Unsupervised learning and Semi-
supervised learning
Step- 3: Training the K-means algorithm on the training dataset
• As we have got the number of clusters, so we can now train the model on the
dataset.
• To train the model, we will use the same two lines of code as we have used in
the above section, but here instead of using i, we will use 5, as we know there
are 5 clusters that need to be formed. The code is given below:

• The first line is the same as above for creating the object of KMeans class.
• In the second line of code, we have created the dependent variable y_predict to
train the model.
CO-2 Unsupervised learning and Semi-
supervised learning
• By executing the above lines of code, we will get the y_predict variable. We can
check it under the variable explorer option in the Spyder IDE. We can now
compare the values of y_predict with our original dataset.
CO-2 Unsupervised learning and Semi-
supervised learning
• From the above image, we can now relate that the CustomerID 1 belongs to a
cluster 3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs
to cluster 4, and so on.
Step-4: Visualizing the Clusters
• The last step is to visualize the clusters. As we have 5 clusters for our model, so
we will visualize each cluster one by one.
• To visualize the clusters will use scatter plot using mtp.scatter() function of
matplotlib.
CO-2 Unsupervised learning and Semi-
supervised learning
CO-2 Unsupervised learning and Semi-
supervised learning
• In above lines of code, we have written code for each clusters, ranging from
1 to 5. The first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0]
containing the x value for the showing the matrix of features values, and the
y_predict is ranging from 0 to 1.
Output:
CO-2 Unsupervised learning and Semi-
supervised learning
• The output image is clearly showing the five different clusters with different
colors. The clusters are formed between two parameters of the dataset;
Annual income of customer and Spending. We can change the colors and
labels as per the requirement or choice. We can also observe some points
from the above patterns, which are given below:
• Cluster1 shows the customers with average salary and average spending so we
can categorize these customers as
• Cluster2 shows the customer has a high income but low spending, so we can
categorize them as careful.
• Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
• Cluster4 shows the customers with low income with very high spending so
they can be categorized as careless.
• Cluster5 shows the customers with high income and high spending so they can
be categorized as target, and these customers can be the most profitable
customers for the mall owner.
GMMS
• Gaussian mixture models (GMMs) are a type of machine learning algorithm.
They are used to classify data into different categories based on the probability
distribution.
What are Gaussian mixture models (GMM)?
• Gaussian mixture models are also relatively robust to outliers, meaning that
they can still yield accurate results even if there are some data points that do
not fit neatly into any of the clusters. This makes GMMs a flexible and
powerful tool for clustering data. It can be understood as a probabilistic model
where Gaussian distributions are assumed for each group and they have
means and covariance which define their parameters.
• GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ).

• A Gaussian distribution is defined as a continuous probability distribution that

takes on a bell-shaped curve. Another name for Gaussian distribution is the
normal distribution.
What are Gaussian mixture models (GMM)?
• GMMs are a probabilistic concept used to model real-world data sets. GMMs
are a generalization of Gaussian distributions and can be used to represent any
data set that can be clustered into multiple Gaussian distributions. The
Gaussian mixture model is a probabilistic model that assumes all the data
points are generated from a mix of Gaussian distributions with unknown
parameters. A Gaussian mixture model can be used for clustering, which is the
task of grouping a set of data points into clusters.
• GMMs can be used to find clusters in data sets where the clusters may not be
clearly defined. Additionally, GMMs can be used to estimate the probability
that a new data point belongs to each cluster.
What are Gaussian mixture models (GMM)?
• GMM has many applications, such as density estimation, clustering, and image
segmentation. For density estimation, GMM can be used to estimate the
probability density function of a set of data points. For clustering, GMM can be
used to group together data points that come from the same Gaussian
distribution. And for image segmentation, GMM can be used to partition an
image into different regions.
What are Gaussian mixture models (GMM)?
• Gaussian mixture models can be used for a variety of use cases, including
identifying customer segments, detecting fraudulent activity, and clustering
images. In each of these examples, the Gaussian mixture model is able to
identify clusters in the data that may not be immediately obvious. As a result,
Gaussian mixture models are a powerful tool for data analysis and should be
considered for any clustering task.
What is the expectation-maximization (EM) method in
relation to GMM?
• In GMMs, an expectation-maximization method is a powerful tool for
estimating the parameters of a Gaussian mixture model (GMM). The
expectation is termed E and maximization is termed M. Expectation is used to
find the Gaussian parameters which are used to represent each component of
gaussian mixture models. Maximization is termed M and it is involved in
determining whether new data points can be added or not.
• The expectation-maximization method is a two-step iterative algorithm that
alternates between performing an expectation step, in which we compute
expectations for each data point using current parameter estimates and then
maximize these to produce a new gaussian, followed by a maximization step
where we update our gaussian means based on the maximum likelihood
estimate.
What is the expectation-maximization (EM) method in
relation to GMM?
• The EM method works by first initializing the parameters of the GMM, then
iteratively improving these estimates. At each iteration, the expectation step
calculates the expectation of the log-likelihood function with respect to the
current parameters. This expectation is then used to maximize the likelihood in
the maximization step. The process is then repeated until convergence. Here
is a picture representing the two-step iterative aspect of the algorithm:
What are the key steps of using Gaussian mixture models for
clustering?
• Determining a covariance matrix that defines how each Gaussian is related to
one another. The more similar two Gaussians are, the closer their means will
be and vice versa if they are far away from each other in terms of similarity. A
gaussian mixture model can have a covariance matrix that is diagonal or
symmetric.
• Determining the number of Gaussians in each group defines how many
clusters there are.
• Selecting the hyperparameters which define how to optimally separate data
using gaussian mixture models as well as deciding on whether or not each
gaussian’s covariance matrix is diagonal or symmetric.
What are the differences between Gaussian mixture models and other types of
clustering algorithms such as K-means?

• A Gaussian mixture model is a type of clustering algorithm that assumes that

the data point is generated from a mixture of Gaussian distributions with
unknown parameters. The goal of the algorithm is to estimate the parameters
of the Gaussian distributions, as well as the proportion of data points that
come from each distribution. In contrast, K-means is a clustering algorithm
that does not make any assumptions about the underlying distribution of the
data points. Instead, it simply partitions the data points into K clusters, where
each cluster is defined by its centroid.
• While Gaussian mixture models are more flexible, they can be more difficult to
train than K-means. K-means is typically faster to converge and so may be
preferred in cases where the runtime is an important consideration.
What are the differences between Gaussian mixture models and other types of
clustering algorithms such as K-means?

• In general, K-means will be faster and more accurate when the data set is large
and the clusters are well-separated. Gaussian mixture models will be more
accurate when the data set is small or the clusters are not well-separated.
• Gaussian mixture models take into account the variance of the data, whereas
K-means does not.
• Gaussian mixture models are more flexible in terms of the shape of the
clusters, whereas K-means is limited to spherical clusters.
• Gaussian mixture models can handle missing data, whereas K-means cannot.
This difference can make Gaussian mixture models more effective in certain
applications, such as data with a lot of noise or data that is not well-defined.
What are the scenarios when Gaussian mixture models can be used?

• Gaussian mixture models can be used in a variety of scenarios, including when

data is generated by a mix of Gaussian distributions when there is uncertainty
about the correct number of clusters, and when clusters have different shapes.
In each of these cases, the use of a Gaussian mixture model can help to
improve the accuracy of results. For example, when data is generated by a mix
of Gaussian distributions, using a Gaussian mixture model can help to better
identify the underlying patterns in the data. In addition, when there is
uncertainty about the correct number of clusters, the use of a Gaussian
mixture model can help to reduce the error rate.
What are the scenarios when Gaussian mixture models can be used?

• Gaussian mixture models can be used for anomaly detection; by fitting a model
to a dataset and then scoring new data points, it is possible to flag points that
are significantly different from the rest of the data (i.e. outliers). This can be
useful for identifying fraud or detecting errors in data collection.
• In the case of time series analysis, GMMs can be used to discover how
volatility is related to trends and noise which can help predict future stock
prices. One cluster could consist of a trend in the time series while another can
have noise and volatility from other factors such as seasonality or external
events which affect the stock price. In order to separate out these clusters,
GMMs can be used because they provide a probability for each category
instead of simply dividing the data into two parts such as that in the case of K-
means.
What are the scenarios when Gaussian mixture models can be used?

• Another example is when there are different groups in a dataset and it’s hard
to label them as belonging to one group or another which makes it difficult for
other machine learning algorithms such as the K-means clustering algorithm to
separate out the data. GMMs can be used in this case because they find
Gaussian mixture models that best describe each group and provide a
probability for each cluster which is helpful when labeling clusters.
• Gaussian mixture models can generate synthetic data points that are similar to
the original data, they can also be used for data augmentation.
What are some real-world examples where Gaussian mixture models can be used?

• Gaussian mixture models are very useful when there are large datasets and it
is difficult to find clusters. This is where Gaussian mixture models help. It is
able to find clusters of Gaussians more efficiently than other clustering
algorithms such as k-means.
• Finding patterns in medical datasets: GMMs can be used for segmenting
images into multiple categories based on their content or finding specific
patterns in medical datasets. They can be used to find clusters of patients with
similar symptoms, identify disease subtypes, and even predict outcomes. In
one recent study, a Gaussian mixture model was used to analyze a dataset of
over 700,000 patient records. The model was able to identify previously
unknown patterns in the data, which could lead to better treatment for
patients with cancer.
What are some real-world examples where Gaussian mixture models can be used?

• Modeling natural phenomena: GMM can be used to model natural phenomena

where it has been found that noise follows Gaussian distributions. This model of
probabilistic modeling relies on the assumption that there exists some
underlying continuum of unobserved entities or attributes and that each
member is associated with measurements taken at equidistant points in multiple
observation sessions.
• Customer behavior analysis: GMMs can be used for performing customer
behavior analysis in marketing to make predictions about future purchases
based on historical data.
• Stock price prediction: Another area Gaussian mixture models are used is in
finance where they can be applied to a stock’s price time series. GMMs can be
used to detect change points in time series data and help find turning points of
stock prices or other market movements that are otherwise difficult to spot due
What are some real-world examples where Gaussian mixture models can be used?

• If you’re looking for an efficient way to find patterns within complicated

datasets or need help modeling natural phenomena like natural disasters or
customer behavior analysis in your marketing, gaussian mixture models could
be the right choice.
Why do we need Gaussian Mixture Models?

• There are two most common areas of Machine Learning – Supervised Learning
and Unsupervised Learning. We can easily distinguish between these two
types based on the nature of data they use and the approaches that go
towards solving the problems. In order to cluster the points based on similar
characteristics, we make use of the clustering algorithms. Let’s assume that we
have the following dataset –
Why do we need Gaussian Mixture Models?

• Our goal is to find the group of points that are close to each other. There are
two different groups that we will color as blue and red.
• One of the most popular clustering techniques is the K-means clustering
algorithm that follows an iterative approach to update parameters of each of
the clusters. We compute the means of each cluster with which we then
compute the means of each cluster and subsequent calculation of their
distance to each data-points. The algorithm then labels these data points by
identifying them by their closest centroid. The process is then repeated until
achievement of some conversion criterion.
Why do we need Gaussian Mixture Models?

• K-means is a hard clustering algorithm. According to this, each point gets

associated to only one cluster. Because of this, there is an absence of
probability that might tell you as to how many data points are associated with
a particular cluster. As a result, we make use of the soft-clustering method.
Gaussian Mixture Models are a perfect candidate for this.
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs

• Since we have initialized the 3 parameters for the 2

components.
• We have to conduct iterations to compute the new value for
each of 3 parameters
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
GMMs
PCA

• Data lies in the heart of Machine Learning and Data Science. During data
collection, we generally record multiple attributes for the sake of not-
loosing any critical information. For example, if there is a requirement of
collecting weather data, we record various attributes like humidity values,
atmospheric pressure, temperature, wind speed, etc. It may be possible
that we will not use all of them for the model building process. Still, we
collect all of them because if a project has some special requirements and
we don't have that attribute, we will have to start the collection process
from scratch.
PCA

• We record multiple features, but at the same time, we have limitations

with the compute capabilities. Using all the recorded features to build our
machine learning models is never a good practice. It will make machine
learning models too heavy and might become impractical to use for real-
time applications. Also, the analysis of data having multiple features is a
cumbersome task. That's where the technique of dimensionality reduction
comes into the picture. one such method of dimensionality reduction,
Principal Component Analysis, popularly known as PCA.
PCA

• Why do we need dimensionality reduction techniques? What is Principle

Component Analysis?
• What are the steps to implement the PCA algorithm? Why do we need to
standardize data before PCA? What is the Scree plot in PCA?
• Comparison with Scikit-learn implementation. What are the assumptions
we make in PCA?
• Where do we use PCA?
PCA

Why do we need dimensionality reduction techniques?

• The first thought that would come to our mind is "Plot the graph of one
attribute with respect to the other attribute". Via this way, we will be able
to identify the underlying relationships. That's correct. But what if we say
that there are ten attributes in data, and we need to analyze their
relationships. Will the answer still be the same?
• We can say, Yes! The answer would remain the same. We can use 2D
plots for every two pairs, i.e., selecting two attributes among 10. From
combination theory the same can be written as:
PCA

Why do we need dimensionality reduction techniques?

• Really? Will it be easy to analyze 45 different plots and then find the
overall relationships among all of them? No! Right? That's where
dimensionality reduction algorithms come into the picture, where we
bring the higher dimensional data into lower dimensions without losing
"much" information. The meaning of this "much" term will be evident in a
while. PCA (Principal Component Analysis) is one such technique for
dimension reduction.
PCA

What is Principle Component Analysis?

• Let's first define it formally: Principal Component Analysis (PCA) is an
unsupervised machine learning technique to reduce the dimensionality of
data consisting of a large number of inter-related attributes (or features
or variables) but at the same time retaining as much as possible of the
variation present in the original data.
• Every attribute in the data is considered as a separate dimension. The PCA
algorithm transforms the data attributes into a newer set of attributes (or
features or variables) called Principal Components (PCs). Considering the
same example above, if we have ten attributes in the data, the PCA
algorithm will form a new set of 10 attributes using the earlier ten
attributes. This new set of attributes will be called PCs. But we might think
that if we still have ten attributes, how did we reduce the dimension?
PCA

What is Principle Component Analysis?

• That's the exciting part of PCA. With the earlier set of attributes, we were
not sure about the relationships, but with the new set of PCs, we know
two significant properties among them:
• These PCs are "uncorrelated," or we can say, the dimension
corresponding to one PC is perpendicular to the dimension corresponding
to all other PCs. So if we calculate the dot product between two PCs, it will
be zero.
• These PCs are arranged (ordered/placed) such that the first few PCs from
the start retain most of the variations present in the original set of
attributes.
PCA
PCA

• If we want to reduce the dimensionality, we can remove some PCs from the
last. Some information will get lost, but this loss will be insignificant as most
of the variation is retained by PCs present at the start. That's why we used
the term "much" earlier. Also, note that we correlate the term "variation"
with the term "information" as attributes having maximum variations
contain more information than others. So, in the last, if we need to
summarize, we can say that PCA is an unsupervised, non-parametric, and
statistical machine learning approach used to reduce dimensionality.
• The term PCA was first coined by Pearson in 1901 and later developed
independently by Hotelling in 1933. Although it originated from multivariate
data problems, now it has a broader range of applications like denoising
signals, blind source separation, and data compression.
PCA

What are the steps to implement the PCA algorithm?

• The best way to understand the whole outflow is to implement it step-
wise. Let's quickly pick one data and try to implement it. For better
visualization of reduction, let's first choose a dataset having only two
attributes, X1 and X2. Selecting a simple case would be better as we
realize or feel the dimension reduction.
• Eigen values and Eigen vectors are used to

find the Principle Component of our data

to reduce the dimensionality

https://pub.towardsai.net/basic-linear-algebra-for-deep-learning-and-machin
e-learning-ml-python-tutorial-444e23db3e9e
PCA

What are the steps to implement the PCA algorithm?

PCA

What are the steps to implement the PCA algorithm?

PCA

What are the steps to implement the PCA algorithm?

Let us move towards PCA steps,
Step-1: Standardization of data
• Calculate the deviation of every element from the mean of corresponding
attributes. i.e., X1-mean(X1) and X2-mean(X2) and then divide the result
by the standard deviation of that attribute. This standardization is also
known as Z-Mean.
PCA
PCA

Why standardize the data in PCA? Why mean centering?

• Researchers justified that the mean centering ensures that the first principal
component is proportional to the maximum variance of the input data.

• Suppose we want to revert to the original set of attributes, but we already

have removed some PCs. So there will be some loss in the original data as
well. We will not be getting the "exact" same attributes, but we can
approximate the values using the basis function. Mean centering is justified
in finding a basis function that minimizes the mean square error

approximating the original data. .

PCA

Why division with standard deviation?

• Division with the standard deviation of the original feature ensures that the
standard deviation of standardized data becomes 1. This will make the
importance of all the attributes equal, and our algorithm will not be biased
towards any specific feature. We must not forget that the overall goal of PCA
is to transform the data such that it retains most of the information/variance.
• To understand it, let's suppose the variance of X1 would be 1, and the variance
of X2 would be 20. In such a scenario, 20 is very high with respect to 1, and
hence, our principal component will overlap (or be biased towards) X2 and
ignore X1 because X2 contains most of the variance. Therefore, to give
importance to all the attributes, we need to make a standard deviation = 1 for
all the attributes.
PCA

Why division with standard deviation?

• Note: In our dataset, the standard deviation of all the attributes (i.e., X1 and
X2) is very close (0.74 and 0.80).

Step2: Calculating the Covariance matrix of data

• Dataset having m attributes will have a (mxm) covariance matrix. For
example, suppose we have three attributes X, Y, and Z in the dataset, then:
PCA
PCA
PCA
PCA

Step 3: Calculate Eigen value and Eigen Vector

• We have the covariance matrix of the data, and now we will calculate the
eigenvalues and eigenvectors of this covariance matrix. Please don't forget that
we are doing all these steps to find the PCs (Principal Components). A dataset
with m attributes will have m*m covariance matrix and hence m PCs.
• Let's have a look at the image shown below. Suppose we have two attributes (A1
and A2) for which the scattered plot is shown as the blue dots. In PCA, our
objective is to find the new attributes such that every attribute is uncorrelated,
and the first few attributes contain most of the variation. We know that more
variance in the data ( or more scatter) corresponds to more information. Let's
focus on the variation of red dots on the rotating line, which are the projection
of blue dots onto the line, and try to find the answer to the question: When will
the average squared distances of red dots from the origin be maximum?
PCA
PCA

• It's the scenario when the rotating line meets the purple lines (4th image of
the most suitable case), which we call the case of maximum variance. So
our first PC will be along the line, which passes through the two purple
lines.
• Similarly, the same procedure will be followed for the second PC to find the
maximum variance but with an additional constraint that it should be
perpendicular to the first PC. Eigenvalues and Eigenvectors help us find the
same two dimensions. Eigenvectors are the unit vectors from the origin,
which correspond to the directions along which the PCs will lie. Eigenvalues
are the coefficients attached with eigenvectors that say about the variance
carried along with corresponding eigenvectors.
PCA

• For our dataset, eigenvalues and eigenvectors are given in the below
images. Every row in the eigenvector matrix corresponds to one
eigenvector. Please note that the magnitude of every eigenvector is 1 (unit
vector), and they are perpendicular to each other. To check their
orthogonality, we can perform dot product of vector_l = [-0.73517866, -
0.6778734]and vector_2 = [0.6778734, -0.73517866].If this product results
in zero, then these vectors must be orthogonal.
PCA
PCA

Step 4: Sort eigenvalues in decreasing order

• Our first principle component (PC) will be the eigenvector corresponding
to the highest eigenvalue will be our first principle component (PC).
Similarly, our second PC will be the eigenvector corresponding to the
second-highest eigenvalue. So to arrange the PCs in the most to least
relevance index, we need to sort the eigenvalue, but it should be done
such that we should not lose the information about which eigenvalue
corresponds to which eigenvector.
• There are only two eigenvalues in our dataset, and the highest is
1.28402771. Hence eignevector_v2 = [0.6778734, -0.73517866]will be our
first PC. As we already know, this PC will contain 96.3% of the total
variance; we can ignore the rest, 3.7%.
PCA

Step 4: Sort eigenvalues in decreasing order

Note: We can choose to retain both dimensions as well. But to feel the
reduction, we have decided to ignore the second PC.
PCA

What is Scree Plot?

• In multivariate statistics, the plot of Eigen values with respect to PCs is known
as Scree Plot. This plot is used in determining the number of PCs to retain. In
the image below, PCs having Eigen value > threshold (shown in red line)
should be retained, and the rest can be discarded. This threshold can be
considered a hyper parameter and tuned according to the project's needs.
PCA

Step 5: Forming the newer feature set along the principal component axes
• Once we have decided the number of components we want to keep, we
need to form the matrix (also known as feature vector)with column
entries as eigen vectors. As we already know, every row of the
eigenvector matrix corresponds to one eigenvector. To form the feature
matrix, we need to place the transpose of selected eigenvectors as
column entries, e.g., [eigenvector/.T, Eigen vector2.T, ...., Eigen
vector_N.T]. Please note that we will use eigenvectors corresponding to
those PCs only that we have selected to keep. In our case, we decided to
keep 1 PC only to realize the reduction. Hence;
PCA
PCA
PCA
PCA

What are the assumptions we make in PCA?

• We make certain assumptions while performing PCA on the feature set.

• The number of samples: Ideally, the number of samples in the data should
be five times the number of features present.
• Features are correlated: We assume that the features present in the
original set should be correlated so that the components formed after PCA
preserves the originality.
• No outliers are present: Outliers can affect the overall variance of the data,
and we know PCA gives importance to the features having high variance.
Standardization generally solves the problem of outlier presence in the
data, but removing outliers before applying standardization will be more
helpful.
PCA

What are the assumptions we make in PCA?

• Lower variance corresponds to lesser information: Components
corresponding to lower eigenvalues are assumed to be noises and the most
prominent candidates to be discarded.
• Linearity: Principal components should be a linear combination of the
original features.
PCA

Where do we use PCA?

• PCA is beneficial among machine learning engineers and data scientists. It is
used for a variety of applications. Some of them are:
• Reducing the image size: We calculate the principal components of the
image and remove components containing lesser information. This way, we
reduce the image size, known as image compression.
• Facial Recognition: To reduce the complexity of the problem statement, we
convert the facial images using PCA. The converted face is also called Eigen
Faces.
• Medical Science: A wider variety of problem statements use PCA. One of
them is to detect the correlation between cholesterol and lipoprotein.
LDA vs. PCA

LDA vs. PCA

• Linear discriminate analysis is very similar to PCA both look for linear
combinations of the features which best explain the data.
• The main difference is that the Linear discriminate analysis is a supervised
dimensionality reduction technique that also achieves classification of the
data simultaneously.
• LDA focuses on finding a feature subspace that maximizes the
separability between the groups.
• While Principal component analysis is an unsupervised Dimensionality
reduction technique, it ignores the class label.
• PCA focuses on capturing the direction of maximum variation in the
data set.
LDA vs. PCA
LDA vs. PCA

LDA and PCA both form a new set of components.

• The PC1 the first principal component formed by PCA will account for
maximum variation in the data.PC2 does the second-best job in capturing
maximum variation and so on.
• The LD1 the first new axes created by Linear Discriminant Analysis will
account for capturing most variation between the groups or categories and
then comes LD2 and so on.
• Note: In, LDA The target dependent variable can have binary or multiclass
labels.
LDA vs. PCA

LDA and PCA both form a new set of components.

• Linear Discriminant Analysis (LDA) is one of the commonly used

dimensionality reduction techniques in machine learning to solve more
than two-class classification problems. It is also known as Normal
Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).
• Linear Discriminant Analysis (LDA) in machine learning”, we will discuss
the LDA algorithm for classification predictive modeling problems,
limitation of logistic regression, representation of linear Discriminant
analysis model, how to make a prediction using LDA, how to prepare data
for LDA, extensions to LDA and much more.
Linear Discriminant Analysis (LDA)
What is Linear Discriminant Analysis (LDA)?
• Although the logistic regression algorithm is limited to only two-class, linear
Discriminant analysis is applicable for more than two classes of classification
problems.
• Linear Discriminant analysis is one of the most popular dimensionality
reduction techniques used for supervised classification problems in machine
learning. It is also considered a pre-processing step for modeling differences
in ML and applications of pattern classification.
• Whenever there is a requirement to separate two or more classes having
multiple features efficiently, the Linear Discriminant Analysis model is
considered the most common technique to solve such classification
problems. For e.g., if we have two classes with multiple features and need to
separate them efficiently. When we classify them using a single feature, then
Linear Discriminant Analysis (LDA)

• To overcome the overlapping issue in the classification process, we must

increase the number of features regularly.

Example:
• Let's assume we have to classify two different classes having two sets of data
points in a 2-dimensional plane as shown below image:
Linear Discriminant Analysis (LDA)

• However, it is impossible to draw a straight line in a 2-d plane that can

separate these data points efficiently but using linear Discriminant analysis; we
can dimensionally reduce the 2-D plane into the 1-D plane. Using this
technique, we can also maximize the separability between multiple classes.
Linear Discriminant Analysis (LDA)

How Linear Discriminant Analysis (LDA) works?

• Linear Discriminant analysis is used as a dimensionality reduction technique in
machine learning, using which we can easily transform a 2-D and 3-D graph
into a 1-dimensional plane.
• Let's consider an example where we have two classes in a 2-D plane having an
X-Y axis, and we need to classify them efficiently. As we have already seen in
the above example that LDA enables us to draw a straight line that can
completely separate the two classes of the data points. Here, LDA uses an X-Y
axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.
• Hence, we can maximize the separation between these classes and reduce the
2-D plane into 1-D.
Linear Discriminant Analysis (LDA)

How Linear Discriminant Analysis (LDA) works?

To create a new axis, Linear Discriminant Analysis uses the following criteria:
• It maximizes the distance between means of two classes.

• It minimizes the variance within the individual class.

Linear Discriminant Analysis (LDA)

How Linear Discriminant Analysis (LDA) works?

• Using the above two conditions, LDA generates a new axis in such a way that it
can maximize the distance between the means of the two classes and
minimizes the variation within each class.
• In other words, we can say that the new axis will increase the separation
between the data points of the two classes and plot them onto the new axis.
Linear Discriminant Analysis (LDA)

Why LDA?
• Logistic Regression is one of the most popular classification algorithms that
perform well for binary classification but falls short in the case of multiple
classification problems with well-separated classes. At the same time, LDA
handles these quite efficiently.
• LDA can also be used in data pre-processing to reduce the number of features,
just as PCA, which reduces the computing cost significantly.
• LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to
extract useful data from different faces. Coupled with eigenfaces, it produces
effective results.
Linear Discriminant Analysis (LDA)

Drawbacks of Linear Discriminant Analysis (LDA)

• Although, LDA is specifically used to solve supervised classification problems
for two or more classes which are not possible using logistic regression in
machine learning. But LDA also fails in some cases where the Mean of the
distributions is shared. In this case, LDA fails to create a new axis that makes
both the classes linearly separable.
• To overcome such problems, we use non-linear Discriminant analysis in
machine learning.

Extension to Linear Discriminant Analysis (LDA)

• Linear Discriminant analysis is one of the most simple and effective methods to
solve classification problems in machine learning. It has so many extensions
and variations as follows:
Linear Discriminant Analysis (LDA)

• Quadratic Discriminant Analysis (QDA): For multiple input variables, each

class deploys its own estimate of variance.
• Flexible Discriminant Analysis (FDA): it is used when there are non-linear
groups of inputs are used, such as splines.
• Flexible Discriminant Analysis (FDA): This uses regularization in the estimate
of the variance (actually covariance) and hence moderates the influence of
different variables on LDA.
Linear Discriminant Analysis (LDA)

Applications of LDA
• Face recognition is the popular application of computer vision, where each
face is represented as the combination of a number of pixel values. In this
case, LDA is used to minimize the number of features to a manageable number
before going through the classification process. It generates a new template in
which each dimension consists of a linear combination of pixel values. If a
linear combination is generated using Fisher's linear discriminant, then it is
called Fisher's face.
• In the medical field, LDA has a great application in classifying the patient
disease on the basis of various parameters of patient health and the medical
treatment which is going on. On such parameters, it classifies disease as mild,
moderate, or severe. This classification helps the doctors in either increasing or
decreasing the pace of the treatment.
Linear Discriminant Analysis (LDA)

• In customer identification, LDA is currently being applied. It means with the

help of LDA; we can easily identify and select the features that can specify the
group of customers who are likely to purchase a specific product in a shopping
mall. This can be helpful when we want to identify a group of customers who
mostly purchase a product in a shopping mall.
• LDA can also be used for making predictions and so in decision making. For
example, "will you buy this product” will give a predicted result of either one
or two possible classes as a buying or not.
• Nowadays, robots are being trained for learning and talking to simulate human
work, and it can also be considered a classification problem. In this case, LDA
builds similar groups on the basis of different parameters, including pitches,
frequencies, sound, tunes, etc.
Linear Discriminant Analysis (LDA)

• Step-1:Compute the global mean(M) using the samples from patients and non-
patients.
• Step-2:Compute the statistics for patients.

Mean vector (M1) for patients.

Covariance matrix (C1) for patients using M.

• Step-3:Compute the statistics for non-patients.

Mean vector(M2) for patients.

Covariance matrix(C2) for patients using M.

• Step-4:Compute within-class scatter matrix C.

• Step-5: Create discriminant functions(F1&F2).

Linear Discriminant Analysis (LDA)

Supervised Vs Unsupervised Learning
100% (1)
Supervised Vs Unsupervised Learning
7 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Unsupervised Learning
No ratings yet
Unsupervised Learning
20 pages
FAM_Unit5
No ratings yet
FAM_Unit5
47 pages
Unsupervised Lec
No ratings yet
Unsupervised Lec
12 pages
Unsupervised - Learning Final
No ratings yet
Unsupervised - Learning Final
20 pages
Machine Learning Intro 2
No ratings yet
Machine Learning Intro 2
15 pages
Machine Learning and Web Scraping Lesson02
No ratings yet
Machine Learning and Web Scraping Lesson02
29 pages
Unit5_ML_introduction
No ratings yet
Unit5_ML_introduction
32 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
2 pages
Data Science Solutions IA 2
No ratings yet
Data Science Solutions IA 2
16 pages
Module 6.1
No ratings yet
Module 6.1
42 pages
Chapter 3notes
No ratings yet
Chapter 3notes
46 pages
chp5 (14) fam
No ratings yet
chp5 (14) fam
13 pages
ML Unit-2 - RTU
No ratings yet
ML Unit-2 - RTU
33 pages
Supervised and Unsupervised Machine Learning
No ratings yet
Supervised and Unsupervised Machine Learning
3 pages
2nd Unit NN Final Class Notes (1)
No ratings yet
2nd Unit NN Final Class Notes (1)
50 pages
types of ml
No ratings yet
types of ml
10 pages
Unit 2 Unsupervised Learning
No ratings yet
Unit 2 Unsupervised Learning
86 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
14 pages
Learning Algorithms
No ratings yet
Learning Algorithms
28 pages
FDS Assignment
No ratings yet
FDS Assignment
76 pages
Module 1 PPT
No ratings yet
Module 1 PPT
122 pages
2 ML
No ratings yet
2 ML
9 pages
Ann Unit 2
No ratings yet
Ann Unit 2
21 pages
m Learning
No ratings yet
m Learning
11 pages
AI - Mod 5. Part 1
No ratings yet
AI - Mod 5. Part 1
30 pages
ML_unit_4
No ratings yet
ML_unit_4
17 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
19 pages
DSA Presentation Group 6
No ratings yet
DSA Presentation Group 6
34 pages
Supervised Vs Unsupervised
No ratings yet
Supervised Vs Unsupervised
8 pages
UNIT4
No ratings yet
UNIT4
12 pages
Supervised learning
No ratings yet
Supervised learning
19 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
AI - W8L15
No ratings yet
AI - W8L15
44 pages
unsupervised learning
No ratings yet
unsupervised learning
4 pages
module 1
No ratings yet
module 1
47 pages
Unit 2 Machine learning aktu
No ratings yet
Unit 2 Machine learning aktu
18 pages
About The Classification and Regression Supervised Learning Problems
No ratings yet
About The Classification and Regression Supervised Learning Problems
3 pages
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
No ratings yet
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
27 pages
Unit III 1
No ratings yet
Unit III 1
22 pages
U5 unsupervised learning
No ratings yet
U5 unsupervised learning
15 pages
2nd Unit NN Final Class Notes
No ratings yet
2nd Unit NN Final Class Notes
51 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
4.introduction To Learning - Unit 2
No ratings yet
4.introduction To Learning - Unit 2
8 pages
Module 1
No ratings yet
Module 1
122 pages
AI using Python
No ratings yet
AI using Python
26 pages
Machine Learning
No ratings yet
Machine Learning
20 pages
AI Unit4 Learning Dd83e0ee 7d19 48c7 Bc5d b39decf3b0fc
No ratings yet
AI Unit4 Learning Dd83e0ee 7d19 48c7 Bc5d b39decf3b0fc
19 pages
Session 3 Types of Machine Learning (1)
No ratings yet
Session 3 Types of Machine Learning (1)
22 pages
PDF&Rendition=1 2
No ratings yet
PDF&Rendition=1 2
27 pages
Group I Discrete Mathematics
No ratings yet
Group I Discrete Mathematics
4 pages
2023 - Chapitre 3 - Cours IA - Version Englais - 2023
No ratings yet
2023 - Chapitre 3 - Cours IA - Version Englais - 2023
24 pages
DS&ML 1
No ratings yet
DS&ML 1
9 pages
Pdf&rendition 1 3
No ratings yet
Pdf&rendition 1 3
49 pages
Unit 1
No ratings yet
Unit 1
19 pages
Clustering Part-A
No ratings yet
Clustering Part-A
41 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
21 pages
Artificial Intelligent: Supervised Learning and Unsupervised Learning
No ratings yet
Artificial Intelligent: Supervised Learning and Unsupervised Learning
17 pages
Pattern Recognition and AI Using Matlab Textbook PDF
No ratings yet
Pattern Recognition and AI Using Matlab Textbook PDF
263 pages
Machine Learning MID-2 Question Bank
No ratings yet
Machine Learning MID-2 Question Bank
2 pages
ML 1
No ratings yet
ML 1
51 pages
Sciarretta Dialectometry Revised
No ratings yet
Sciarretta Dialectometry Revised
29 pages
Papers in Quantitative Finance March 2024 1712238549
No ratings yet
Papers in Quantitative Finance March 2024 1712238549
27 pages
Principal Component Analysis 4 Dummies
100% (1)
Principal Component Analysis 4 Dummies
8 pages
10-601 Machine Learning: Homework 7: Instructions
No ratings yet
10-601 Machine Learning: Homework 7: Instructions
5 pages
CS434a/541a: Pattern Recognition Prof. Olga Veksler
No ratings yet
CS434a/541a: Pattern Recognition Prof. Olga Veksler
42 pages
Classification of Hyperspectral Images
No ratings yet
Classification of Hyperspectral Images
64 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
Unit No.02 - Feature Extraction and Selection
No ratings yet
Unit No.02 - Feature Extraction and Selection
17 pages
Data Modeling - Cheatsheet
No ratings yet
Data Modeling - Cheatsheet
9 pages
Predictive Maintenance of Machine Tool Systems Using Artifici
No ratings yet
Predictive Maintenance of Machine Tool Systems Using Artifici
6 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 4 - Machine Learning - WWW - Rgpvnotes.in PDF
27 pages
Applications of deep learning in stock market prediction Recent progress
No ratings yet
Applications of deep learning in stock market prediction Recent progress
22 pages
Dimensionality Reduction and Clustering Research
No ratings yet
Dimensionality Reduction and Clustering Research
17 pages
SHAURYA SINGH M.Tech
No ratings yet
SHAURYA SINGH M.Tech
56 pages
A Review of Intelligent Airfoil Aerodynamic Optimization Methods Based On Data-Driven Advanced Models (For Aerodynamic Shape Optimization) (2023)
No ratings yet
A Review of Intelligent Airfoil Aerodynamic Optimization Methods Based On Data-Driven Advanced Models (For Aerodynamic Shape Optimization) (2023)
21 pages
6.036 Notes
No ratings yet
6.036 Notes
99 pages
Genai Manual
No ratings yet
Genai Manual
103 pages
Pratik Zanke Factor Hair Revised
No ratings yet
Pratik Zanke Factor Hair Revised
37 pages
Data Mining BITS-PILANI Mid Semester Sample
No ratings yet
Data Mining BITS-PILANI Mid Semester Sample
10 pages
Alzubi 2018 J. Phys. Conf. Ser. 1142 012012
No ratings yet
Alzubi 2018 J. Phys. Conf. Ser. 1142 012012
23 pages
Machine Learning (ML) Solved MCQs [Set-2] McqMate.com
No ratings yet
Machine Learning (ML) Solved MCQs [Set-2] McqMate.com
6 pages
A Systematic Literature Review On Modern Methods of Construction
No ratings yet
A Systematic Literature Review On Modern Methods of Construction
29 pages
ML Full Notes
No ratings yet
ML Full Notes
66 pages
Puspita 2019 J. Phys. Conf. Ser. 1196 012073
No ratings yet
Puspita 2019 J. Phys. Conf. Ser. 1196 012073
8 pages
Journal Tea Eng
No ratings yet
Journal Tea Eng
7 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
cheat sheet
No ratings yet
cheat sheet
2 pages

K-means & GMM

Uploaded by

K-means & GMM

Uploaded by

CO-2 Unsupervised learning and Semi-

Hierarchical clustering uses a tree-like structure, like so:

Manhattan distance measure

Cosine distance measure

The steps to form clusters are:

• Step 2: Assign each x(i) to the closest cluster by implementing Euclidean

• Step 4: Keep repeating step 2 and step 3 until convergence is achieved

• In the above formula of WCSS,

So the number of clusters here will be 5.

• A Gaussian distribution is defined as a continuous probability distribution that

• A Gaussian mixture model is a type of clustering algorithm that assumes that

• Gaussian mixture models can be used in a variety of scenarios, including when

• Modeling natural phenomena: GMM can be used to model natural phenomena

• If you’re looking for an efficient way to find patterns within complicated

• K-means is a hard clustering algorithm. According to this, each point gets

• Since we have initialized the 3 parameters for the 2

• We record multiple features, but at the same time, we have limitations

• Why do we need dimensionality reduction techniques? What is Principle

Why do we need dimensionality reduction techniques?

Why do we need dimensionality reduction techniques?

What is Principle Component Analysis?

What is Principle Component Analysis?

What are the steps to implement the PCA algorithm?

find the Principle Component of our data

to reduce the dimensionality

What are the steps to implement the PCA algorithm?

What are the steps to implement the PCA algorithm?

What are the steps to implement the PCA algorithm?

Why standardize the data in PCA? Why mean centering?

• Suppose we want to revert to the original set of attributes, but we already

approximating the original data. .

Why division with standard deviation?

Why division with standard deviation?

Step2: Calculating the Covariance matrix of data

Step 3: Calculate Eigen value and Eigen Vector

Step 4: Sort eigenvalues in decreasing order

Step 4: Sort eigenvalues in decreasing order

What is Scree Plot?

What are the assumptions we make in PCA?

What are the assumptions we make in PCA?

Where do we use PCA?

LDA vs. PCA

LDA and PCA both form a new set of components.

LDA and PCA both form a new set of components.

• Linear Discriminant Analysis (LDA) is one of the commonly used

• To overcome the overlapping issue in the classification process, we must

• However, it is impossible to draw a straight line in a 2-d plane that can

How Linear Discriminant Analysis (LDA) works?

How Linear Discriminant Analysis (LDA) works?

• It minimizes the variance within the individual class.

How Linear Discriminant Analysis (LDA) works?

Drawbacks of Linear Discriminant Analysis (LDA)

Extension to Linear Discriminant Analysis (LDA)

• Quadratic Discriminant Analysis (QDA): For multiple input variables, each

• In customer identification, LDA is currently being applied. It means with the

Mean vector (M1) for patients.

Covariance matrix (C1) for patients using M.

Mean vector(M2) for patients.

Covariance matrix(C2) for patients using M.

• Step-5: Create discriminant functions(F1&F2).

You might also like