Unit-5
Unit-5
K-means clustering is an unsupervised machine learning algorithm that creates groups made up of similar
data points. It has various applications, including customer segmentation, anomaly detection, and sentiment
analysis.
K means clustering, assigns data points to one of the K clusters depending on their distance from the center
of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign
to one of the cluster based on its distance from centroid of the cluster. After assigning each point to one of
the cluster, new cluster centroids are assigned. This process runs iteratively until it finds good cluster. In the
analysis we assume that number of cluster is given in advanced and we have to put points in one of the
group.
K Means clustering performs best data is well separated. When data points overlapped this clustering is not
suitable. K Means is faster as compare to other clustering technique. It provides strong coupling between the
data points. K Means cluster do not provide clear information regarding the quality of clusters. Different
initial assignment of cluster centroid may lead to different clusters. Also, K Means algorithm is sensitive to
noise. It may have stuck in local minima.
K-means K-medoids
K-means takes the mean of data points to create K-medoids uses points from the data to serve as
new points called centroids. points called medoids.
Centroids are new points previously not found in
Medoids are existing points from the data.
the data.
K-medoids can be used for both numerical and
K-means can only by used for numerical data.
categorical data.
K-means focuses on reducing the sum of squared K-medoids focuses on reducing the
distances, also known as the sum of squared error dissimilarities between clusters of data from the
(SSE). dataset.
K-means uses Euclidean distance. K-medoids uses Manhattan distance.
K-medoids is outlier resistant and can reduce the
K-means is not sensitive to outliers within the
effect of outliers. More useful when the clusters
data.
have irregular shapes or different sizes.
5.3 Gaussian Mixture
A Gaussian mixture is a function that is composed of several Gaussians, each identified by k ∈ {1,…, K},
where K is the number of clusters of our data set. Each Gaussian k in the mixture is comprised of the
following parameters:
Here, we can see that there are three Gaussian functions, hence K = 3. Each Gaussian explains the data
contained in each of the three clusters available. The mixing coefficients are themselves probabilities and
must meet this condition:
we must ensure that each Gaussian fits the data points belonging to each cluster.
5.5.1 Supervised Learning is a type of machine learning where a model is trained on labelled data—
meaning each input is paired with the correct output. the model learns by comparing its predictions with the
actual answers provided in the training data. Over time, it adjusts itself to minimize errors and improve
accuracy. The goal of supervised learning is to make accurate predictions when given new, unseen data. For
example, if a model is trained to recognize handwritten digits, it will use what it learned to correctly identify
new numbers it hasn’t seen before.
Even though classification and regression are both from the category of supervised learning, they are not the
same.
• The prediction task is a classification when the target variable is discrete. An application is the
identification of the underlying sentiment of a piece of text.
• The prediction task is a regression when the target variable is continuous. An example can be the
prediction of the salary of a person given their education degree, previous work experience,
geographical location, and level of seniority.
1. Decision tree: A popular algorithm that builds classification or regression models in the form of a tree
structure. It can classify both categorical and continuous dependent variables.
2. Logistic regression: A statistical method used for binary classification problems, where the outcome is
dichotomous, meaning it can take on one of two possible values.
3. Random forest: A versatile technique that improves accuracy by combining multiple independent decision
trees.
5.5.2 Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labelled data and a large amount of
unlabeled data to train a model. The goal of semi-supervised learning is to learn a function that can
accurately predict the output variable based on the input variables, similar to supervised learning. However,
unlike supervised learning, the algorithm is trained on a dataset that contains both labelled and unlabelled
data. Semi-supervised learning is particularly useful when there is a large amount of unlabeled data
available, but it’s too expensive or difficult to label all of it.
• Hierarchical clustering: An unsupervised learning method that builds clusters by measuring the
dissimilarities between data points.
• ISODATA: An unsupervised classification technique that calculates class means and iteratively
clusters the remaining pixels
Therefore, N(A ∩ B)/N can be written as P(A ∩ B) and N(B)/N as P(B) ⇒ P(A|B) = P(A ∩ B)/P(B)
Similarly, the probability of occurrence of B when A has already occurred is given by,
P(B|A) = P(B ∩ A)/P(A)
Bayes’ theorem defines the probability of occurrence of an event associated with any condition. It is
considered for the case of conditional probability. Also, this is known as the formula for the likelihood of
“causes”.
5.6 KNN v/s ANN
Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning, using
which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need to
classify them efficiently. As we have already seen in the above example that LDA enables us to draw a
straight line that can completely separate the two classes of the data points. Here, LDA uses an X-Y axis to
create a new axis by separating them using a straight line and projecting data onto a new axis.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can maximize the distance
between the means of the two classes and minimizes the variation within each class.
5.8 PCA
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for
each value of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries
the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional
data set with 3 variables x, y, and z, the covariance matrix is a 3×3 data matrix of this from:
What do the covariances that we have as entries of the matrix tell us about the correlations between
the variables?
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal
components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes
5.9 Here are some key differences between PCA and LDA:
Objective: PCA is an unsupervised technique that aims to maximize the variance of the data along the
principal components. The goal is to identify the directions that capture the most variation in the data. LDA,
on the other hand, is a supervised technique that aims to maximize the separation between different classes
in the data. The goal is to identify the directions that capture the most separation between the classes.
Supervision: PCA does not require any knowledge of the class labels of the data, while LDA requires
labeled data in order to learn the separation between the classes.
Dimensionality Reduction: PCA reduces the dimensionality of the data by projecting it onto a lower-
dimensional space, while LDA reduces the dimensionality of the data by creating a linear combination of the
features that maximizes the separation between the classes.
Output: PCA outputs principal components, which are linear combinations of the original features. These
principal components are orthogonal to each other and capture the most variation in the data. LDA outputs
discriminant functions, which are linear combinations of the original features that maximize the separation
between the classes.
Interpretation: PCA is often used for exploratory data analysis, as the principal components can be used to
visualize the data and identify patterns. LDA is often used for classification tasks, as the discriminant
functions can be used to separate the classes.
Performance: PCA is generally faster and more computationally efficient than LDA, as it does not require
labeled data. However, LDA may be more effective at capturing the most important information in the data
when class labels are available.