0% found this document useful (0 votes)
2 views

Unit-5

The document discusses various machine learning techniques, including K-means clustering, K-medoids, Gaussian mixtures, and linear discriminant functions, highlighting their applications and differences. It also covers forms of learning such as supervised, unsupervised, and semi-supervised learning, along with Bayes' theorem and comparisons between KNN and ANN. Additionally, it explains dimensionality reduction techniques like PCA and LDA, emphasizing their objectives, methodologies, and performance differences.

Uploaded by

Rahul Vijay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit-5

The document discusses various machine learning techniques, including K-means clustering, K-medoids, Gaussian mixtures, and linear discriminant functions, highlighting their applications and differences. It also covers forms of learning such as supervised, unsupervised, and semi-supervised learning, along with Bayes' theorem and comparisons between KNN and ANN. Additionally, it explains dimensionality reduction techniques like PCA and LDA, emphasizing their objectives, methodologies, and performance differences.

Uploaded by

Rahul Vijay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT 5

5.1 K-Means Clustering

K-means clustering is an unsupervised machine learning algorithm that creates groups made up of similar
data points. It has various applications, including customer segmentation, anomaly detection, and sentiment
analysis.

K means clustering, assigns data points to one of the K clusters depending on their distance from the center
of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign
to one of the cluster based on its distance from centroid of the cluster. After assigning each point to one of
the cluster, new cluster centroids are assigned. This process runs iteratively until it finds good cluster. In the
analysis we assume that number of cluster is given in advanced and we have to put points in one of the
group.

K Means clustering performs best data is well separated. When data points overlapped this clustering is not
suitable. K Means is faster as compare to other clustering technique. It provides strong coupling between the
data points. K Means cluster do not provide clear information regarding the quality of clusters. Different
initial assignment of cluster centroid may lead to different clusters. Also, K Means algorithm is sensitive to
noise. It may have stuck in local minima.

5.2 K-Means v/s K-Mediods

K-means K-medoids
K-means takes the mean of data points to create K-medoids uses points from the data to serve as
new points called centroids. points called medoids.
Centroids are new points previously not found in
Medoids are existing points from the data.
the data.
K-medoids can be used for both numerical and
K-means can only by used for numerical data.
categorical data.
K-means focuses on reducing the sum of squared K-medoids focuses on reducing the
distances, also known as the sum of squared error dissimilarities between clusters of data from the
(SSE). dataset.
K-means uses Euclidean distance. K-medoids uses Manhattan distance.
K-medoids is outlier resistant and can reduce the
K-means is not sensitive to outliers within the
effect of outliers. More useful when the clusters
data.
have irregular shapes or different sizes.
5.3 Gaussian Mixture
A Gaussian mixture is a function that is composed of several Gaussians, each identified by k ∈ {1,…, K},
where K is the number of clusters of our data set. Each Gaussian k in the mixture is comprised of the
following parameters:

• A mean μ that defines its center.


• A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a
multivariate scenario.
• A mixing probability π that defines how big or small the Gaussian function will be.

Let’s illustrate these parameters graphically:

Here, we can see that there are three Gaussian functions, hence K = 3. Each Gaussian explains the data
contained in each of the three clusters available. The mixing coefficients are themselves probabilities and
must meet this condition:

we must ensure that each Gaussian fits the data points belonging to each cluster.

5.4 Linear Discriminant Function


A linear discriminant function is a statistical method that helps classify data into groups by finding a linear
combination of features that best separates the data. A linear discriminant function separates data points into
different classes or categories based on their features. It finds a straight line or plane that best separates the
groups while minimizing overlap within each class. It's used in machine learning for pattern recognition,
image retrieval, and face detection algorithms. It's often chosen because it's simple, requires less training
time, and can overcome training set imbalances.
5.5 Forms of Learning
There are four main categories of Machine Learning algorithms: supervised, unsupervised, semi-supervised,
and reinforcement learning.

5.5.1 Supervised Learning is a type of machine learning where a model is trained on labelled data—
meaning each input is paired with the correct output. the model learns by comparing its predictions with the
actual answers provided in the training data. Over time, it adjusts itself to minimize errors and improve
accuracy. The goal of supervised learning is to make accurate predictions when given new, unseen data. For
example, if a model is trained to recognize handwritten digits, it will use what it learned to correctly identify
new numbers it hasn’t seen before.
Even though classification and regression are both from the category of supervised learning, they are not the
same.

• The prediction task is a classification when the target variable is discrete. An application is the
identification of the underlying sentiment of a piece of text.
• The prediction task is a regression when the target variable is continuous. An example can be the
prediction of the salary of a person given their education degree, previous work experience,
geographical location, and level of seniority.

Here are some examples of supervised classification:

• Email filtering: Classifying incoming emails as spam or legitimate


• Credit scoring: Using supervised learning for credit scoring
• Voice recognition: Using supervised learning for voice recognition

Some algorithms used in supervised classification include:

1. Decision tree: A popular algorithm that builds classification or regression models in the form of a tree
structure. It can classify both categorical and continuous dependent variables.
2. Logistic regression: A statistical method used for binary classification problems, where the outcome is
dichotomous, meaning it can take on one of two possible values.
3. Random forest: A versatile technique that improves accuracy by combining multiple independent decision
trees.

5.5.2 Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labelled data and a large amount of
unlabeled data to train a model. The goal of semi-supervised learning is to learn a function that can
accurately predict the output variable based on the input variables, similar to supervised learning. However,
unlike supervised learning, the algorithm is trained on a dataset that contains both labelled and unlabelled
data. Semi-supervised learning is particularly useful when there is a large amount of unlabeled data
available, but it’s too expensive or difficult to label all of it.

Semi-Supervised Learning Flow Chart


5.5.3 Unsupervised classification is a machine learning technique that uses software to automatically group
pixels or data points with similar characteristics without the user providing sample classes. The software
uses statistical routines called "clustering" to analyze an image and group pixels into classes based on their
shared characteristics.
Unsupervised classification can be helpful when data science teams don't know what they're looking for in
data. For example, it can be used to categorize users based on their social media activity.
The software will generally result in more categories than the user is interested in. The user will then need to
decide which categories can be grouped together.

Some examples of unsupervised classification techniques include:

• Hierarchical clustering: An unsupervised learning method that builds clusters by measuring the
dissimilarities between data points.
• ISODATA: An unsupervised classification technique that calculates class means and iteratively
clusters the remaining pixels

5.6 Bayes’ theorem


When the intersection of two events happen, then the formula for conditional probability for the occurrence
of two events is given by- P(A|B) = N(A∩B)/N(B) -(eq1) Or P(B|A) = N(A∩B)/N(A) - (eq2)
Where, P(A|B) represents the probability of occurrence of A given B has occurred.
N(A ∩ B) is the number of elements common to both A and B.
N(B) is the number of elements in B, and it cannot be equal to zero.

Let N represent the total number of elements in the sample space.

Therefore, N(A ∩ B)/N can be written as P(A ∩ B) and N(B)/N as P(B) ⇒ P(A|B) = P(A ∩ B)/P(B)

Therefore, P(A ∩ B) = P(B) P(A|B) if P(B) ≠ 0 ⇒ P(A) P(B|A) if P(A) ≠ 0

Similarly, the probability of occurrence of B when A has already occurred is given by,
P(B|A) = P(B ∩ A)/P(A)

Bayes’ theorem defines the probability of occurrence of an event associated with any condition. It is
considered for the case of conditional probability. Also, this is known as the formula for the likelihood of
“causes”.
5.6 KNN v/s ANN

Criteria KNN ANN


Known for precision,
Prioritizes speed over exact precision, approximates
Accuracy identifies 'k' closest
nearest neighbors with slight accuracy compromise
neighbors for high accuracy
Slower due to exhaustive
Faster searches through approximation techniques, suitable
Speed search, especially with large
for real-time applications
datasets
Requires significant
Computational Designed for efficiency, reduces computational load
computational resources,
Resources through indexing and approximation
especially with large datasets
Challenges in scaling due to
Scalability Handles larger datasets effectively, scalable
high computational demands
Best for tasks requiring high
accuracy (e.g., medical Ideal for scenarios needing speed and scalability (e.g.,
Use Cases
diagnosis, financial search engines, recommendation systems)
forecasting)
Instance-based learning,
Learning makes predictions based on Not applicable in the traditional sense; ANN is an
Approach proximity without building a optimization technique rather than a learning method
model
Capable of capturing complex non-linear relationships, but
Handles non-linearity
Handling of this is more relevant to neural networks rather than ANN
naturally based on local data
Non-linearity specifically[Note: This point is more about neural
patterns
networks than ANN, but included for completeness]

5.7 Linear Discriminant Analysis (LDA)


Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis or Discriminant
Function Analysis, is a dimensionality reduction technique primarily utilized in supervised classification
problems. It facilitates the modelling of distinctions between groups, effectively separating two or more
classes. LDA operates by projecting features from a higher-dimensional space into a lower-dimensional one.
In machine learning, LDA serves as a supervised learning algorithm specifically designed for classification
tasks, aiming to identify a linear combination of features that optimally segregates classes within a dataset.
For example, we have two classes and we need to separate them efficiently. Classes can have multiple
features. Using only a single feature to classify them may result in some overlapping as shown in the below
figure. So, we will keep on increasing the number of features for proper classification.

Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning, using
which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need to
classify them efficiently. As we have already seen in the above example that LDA enables us to draw a
straight line that can completely separate the two classes of the data points. Here, LDA uses an X-Y axis to
create a new axis by separating them using a straight line and projecting data onto a new axis.

To create a new axis, Linear Discriminant Analysis uses the following criteria:

• It maximizes the distance between means of two classes.


• It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new axis in such a way that it can maximize the distance
between the means of the two classes and minimizes the variation within each class.

5.8 PCA

1. Standardize the range of continuous initial variables

Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for
each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.

2. Compute the covariance matrix to identify correlations

The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries
the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional
data set with 3 variables x, y, and z, the covariance matrix is a 3×3 data matrix of this from:

Covariance Matrix for 3-Dimensional Data.


Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left
to bottom right) we actually have the variances of each initial variable. And since the covariance is
commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric with respect to the
main diagonal, which means that the upper and the lower triangular portions are equal.

What do the covariances that we have as entries of the matrix tell us about the correlations between
the variables?

It’s actually the sign of the covariance that matters:

• If positive then: the two variables increase or decrease together (correlated)


• If negative then: one increases when the other decreases (Inversely correlated)

3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal
components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes

5.9 Here are some key differences between PCA and LDA:
Objective: PCA is an unsupervised technique that aims to maximize the variance of the data along the
principal components. The goal is to identify the directions that capture the most variation in the data. LDA,
on the other hand, is a supervised technique that aims to maximize the separation between different classes
in the data. The goal is to identify the directions that capture the most separation between the classes.

Supervision: PCA does not require any knowledge of the class labels of the data, while LDA requires
labeled data in order to learn the separation between the classes.

Dimensionality Reduction: PCA reduces the dimensionality of the data by projecting it onto a lower-
dimensional space, while LDA reduces the dimensionality of the data by creating a linear combination of the
features that maximizes the separation between the classes.

Output: PCA outputs principal components, which are linear combinations of the original features. These
principal components are orthogonal to each other and capture the most variation in the data. LDA outputs
discriminant functions, which are linear combinations of the original features that maximize the separation
between the classes.

Interpretation: PCA is often used for exploratory data analysis, as the principal components can be used to
visualize the data and identify patterns. LDA is often used for classification tasks, as the discriminant
functions can be used to separate the classes.

Performance: PCA is generally faster and more computationally efficient than LDA, as it does not require
labeled data. However, LDA may be more effective at capturing the most important information in the data
when class labels are available.

You might also like