FML CIE2
FML CIE2
PART-A
Multidimensional Scaling
A Simple Example
Multidimensional scaling uses a square, symmetric matrix for input. The matrix shows relationships
between items. For a simple example, let’s say you had a set of cities in Florida and their distances:
The scaling produces a graph like the one below.
The very simple example above shows cities and distances, which are easy to visualize as a map.
However, multidimensional scaling can work on “theoretically” mapped data as well. For example,
Kruskal and Wish (1978) outlined how the method could be used to uncover the answers to a variety
of questions about people’s viewpoints on political candidates. This could be achieved by reducing the
data and issues (say, partisanship and ideology) to a two-dimensional map.
We can see that these two points are relatively close to each other
within the 3D space. If we used a linear dimensionality reduction
approach such as PCA, then the Euclidean distance between these two
points would remain somewhat similar in lower dimensions. See PCA
transformation chart below:
Note, the shape of a 2D object in PCA looks like a picture taken of the
same 3D object but from a specific angle. This is a feature of linear
transformation.
In binary classification in particular, for instance if we let (k =1, l =2), then we would
define constant a0, given below, where π1 and π2 are prior probabilities for the
two classes and μ1 and μ2 are mean vectors.
o Define (a1,a2,...,ap)T=Σ−1(μ1−μ2)
o An example
∗π1=π2=0.5
∗μ1=(0,0)T,μ2=(2,−2)T
∗Σ=(1.00.00.00.5625)
∗Decision boundary: 5.56−2.00x1+3.56x2=0.0
We have two classes and we know the within-class density. The marginal density
is simply the weighted sum of the within-class densities, where the weights are
the prior probabilities. Because we have equal weights and because the
covariance matrix two classes are identical, we get these symmetric lines in the
contour plot. The black diagonal line is the decision boundary for the two classes.
Basically, if you are given an x above the line, then we would classify this x into the
first-class. If it is below the line, we would classify it into the second class.
For all of the discussion above we assume that we have the prior probabilities for
the classes and we also had the within-class densities given to us. Of course, in
practice, you don't have this. In practice, what we have is only a set of training
data.
Factor Analysis
Factor analysis is a technique that is used to reduce a large number of variables into
fewer numbers of factors. This technique extracts maximum common variance from
all variables and puts them into a common score. As an index of all variables, we can
use this score for further analysis. Factor analysis is part of general linear model
(GLM) and this method also assumes several assumptions: there is linear
relationship, there is no multicollinearity, it includes relevant variables into analysis,
and there is true correlation between variables and factors. Several methods are
available, but principal component analysis is used most commonly.
Types of factoring:
There are different types of methods used to extract the factor from the data set:
4. Maximum likelihood method: This method also works on correlation metric but it
uses maximum likelihood method to factor.
5. Other methods of factor analysis: Alfa factoring outweighs least squares. Weight
square is another regression based method which is used for factoring.
Factor loading:
Factor loading is basically the correlation coefficient for the variable and
factor. Factor loading shows the variance explained by the variable on that particular
factor. In the SEM approach, as a rule of thumb, 0.7 or higher factor loading represents
that the factor extracts sufficient variance from that variable.
Eigenvalues: Eigenvalues is also called characteristic roots. Eigenvalues shows
variance explained by that particular factor out of the total variance. From the
commonality column, we can know how much variance is explained by the first factor
out of the total variance. For example, if our first factor explains 68% variance out of
the total, this means that 32% variance will be explained by the other factor.
Factor score: The factor score is also called the component score. This score is of
all row and columns, which can be used as an index of all variables and can be used
for further analysis. We can standardize this score by multiplying a common
term. With this factor score, whatever analysis we will do, we will assume that all
variables will behave as factor scores and will move.
Criteria for determining the number of factors: According to the Kaiser Criterion,
Eigenvalues is a good criteria for determining a factor. If Eigenvalues is greater than
one, we should consider that a factor and if Eigenvalues is less than one, then we
should not consider that a factor. According to the variance extraction rule, it should
be more than 0.7. If variance is less than 0.7, then we should not consider that a
factor.
Assumptions:
2. Adequate sample size: The case must be greater than the factor.
A “factor” is a set of observed variables that have similar response patterns; They are associated with a
hidden variable (called a confounding variable) that isn’t directly measured. Factors are listed according to
factor loadings, or how much variation in the data they can explain
10. Illustrate the computer program that does this for different values of k and
c.In image compression, k-means can be used as follows: The image is divided
into non-overlapping c cross c windows and these c2-dimensional vectors
make up the sample. For a given k, which is generally a power of two, we do
k-means clustering. The reference vectors and the indices for each window is
sent over the communication line. At the receiving end, the image is then re-
constructed by reading from the table of reference vectors using the indices.
For each case, calculate the re-construction error and the compression rate.
PART-B
This is need for feature selection of a model. PCA aims to capture valuable information
explaining high variance which results in providing the best accuracy.
It makes the data visualizations easy to handle. It decreases the complexity of the model
and increases computational efficiency.
Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.
K Means Clustering:
Before you know that What is Clustering? Clustering is an unsupervised learning algorithm.
It is used to identify groups of similar objects in a multivariate dataset collected from
various industries.
There are many algorithms in the Clustering algorithm. One of the most popular
algorithms is k means clustering algorithm.
K-Means Clustering:
It is an iterative algorithm that tries to partition the dataset into K pre-defined distinct
non-overlapping subgroups(clusters) where each data point belongs to only one group.
Both uses are in dimensionality reduction for visualizing patterns in data from
parameters(variables).
PCA in conjunction with K-means is a powerful method for visualizing high dimensional
data.
Step3: Visualize and Interpret the Clusters(K means clustered using PCA
components(variances))
ANSWER:
We will start by explaining how this method works and then we will
discuss its advantages and limitations.
So let’s get started!
1. How to identify the best model of each size (2nd step in the
figure above)
2. How to identify the best overall model (3rd step in the figure
above)
1. Determine the best model of each size
• The one that predicts the outcome with less errors — so the
one with the lowest RSS (Residuals Sum of Squares)
Note that in both cases the selected model will be the same.
Why not?
Because the model with most variables is always going to have the
lowest sum of squared errors and the highest R 2. So we cannot lose by
adding more predictors, but something worse can happen which is
choosing a model that overfits the data — that is closely fitting the
sample data while having a low out-of-sample accuracy.
In order to deal with this problem, one solution is to choose the best
overall model according to a statistic that imposes some sort of penalty
on bigger models, especially when they contain additional variables
that barely provide any improvement.
This is where the AIC, BIC and adjusted R2 come into play.
What these will do, in simple terms, is estimate the out-of-sample
accuracy.
You can leave out part of your dataset to test these models and get a
direct measure of the out-of-sample accuracy. But this can only be done
in special cases where data collection is not expensive and time
consuming (which is almost never the case, especially in medical
research).
BOTTOM LINE:
Depending on which parameter you choose, the best overall model will
be the one that has the lowest AIC, BIC, or the highest adjusted R2.
1. Computational limitation:
The number of models the best subset algorithm has to consider grows
exponentially with the number of predictors under consideration.
For example:
2. Theoretical limitation:
K Means clustering needed advance In hierarchical clustering one can stop at any
knowledge of K i.e. no. of clusters one number of clusters, one find appropriate by
want to divide your data. interpreting the dendrogram.
One can use median or mean as a Agglomerative methods begin with ‘n’
cluster centre to represent each clusters and sequentially combine similar
cluster. clusters until only one cluster is obtained.
A dataset contains a huge number of input features in various cases, which makes the predictive
modeling task more complicated. Because it is very difficult to visualize or make predictions
for the training dataset with a high number of features, for such cases, dimensionality reduction
techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information." These techniques are widely used in machine learning
for obtaining a better fit predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
51.2M
996
C++ vs Java
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as the curse
of dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the number
of samples also gets increased proportionally, and the chance of overfitting also increases. If
the machine learning model is trained on high-dimensional data, it becomes overfitted and
results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out
the irrelevant features present in a dataset to build a model of high accuracy. In other words, it
is a way of selecting the optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate
the performance. The performance decides whether to add those features or remove to increase
the accuracy of the model. This method is more accurate than the filtering method but complex
to work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common
techniques of Embedded methods are:
o LASSO
o Elastic Net
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into
space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.
c. Kernel PCA
b. Backward Elimination
c. Forward Selection
d. Score comparison
h. Random Forest
i. Factor Analysis
j. Auto-Encoder
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels.
o In this technique, firstly, all the n variables of the given dataset are taken to train the
model.
o Now we will remove one feature each time and train the model on n-1 features for n
times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance
of the model, and then we will drop that variable or features; after that, we will be left
with n-1 features.
In this technique, by selecting the optimum performance of the model and maximum tolerable
error rate, we can define the optimal number of features require for the machine learning
algorithms.
o The process will be repeated until we get a significant increase in the performance of
the model.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning.
This algorithm contains an in-built feature importance package, so we do not need to program
it separately. In this technique, we need to generate a large set of trees against the target
variable, and with the help of usage statistics of each attribute, we need to find the subset of
features.
Random forest algorithm takes only numerical variables, so we need to convert the input data
into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according
to the correlation with other variables, it means variables within a group can have a
high correlation between themselves, but they have a low correlation with variables of
other groups.
, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
51.1M
1.1K
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value
of k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider the
below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate
the distance between two points. So, we will draw a median between both the
centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of
the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
How to choose the value of "K number of clusters"
in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms. But choosing the optimal number of clusters is a big task. There
are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of K.
The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels. It is
a feature extraction technique, so it contains the important variables and drops the
least important variable.
51.1M
1.1K
o Correlation: It signifies that how strongly two variables are related to each other. Such
as if one changes, the other variable also gets changed. The correlation value ranges
from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and
+1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields
where PCA is used are Finance, data mining, Psychology, etc.
ANSWER:
In data mining and statistics, hierarchical clustering analysis is a method of
cluster analysis that seeks to build a hierarchy of clusters i.e. tree-type
structure based on the hierarchy.
Basically, there are two types of hierarchical cluster analysis strategies
–
• Python3
import numpy as np
print(clustering.labels_)
Output :
Algorithm :
giv data is in its own singleton cluster
Implementation of Dendrograms:
(Code by Author)
(Image by Author), Left Image: Visualize the sample dataset, Right Image: Visualize 3 cluster for the
sample dataset
For the above sample dataset, it is observed that the optimal number
of clusters would be 3. But for high dimension dataset where
visualization is of the dataset is not possible dendrograms plays an
important role to find the optimal number of clusters.
How to find the optimal number of clusters by observing the
dendrograms:
(Image by Author), Left: Separating into 2 clusters, Right: Separating into 3 clusters
Introduction:
Clustering is an unsupervised learning method whose task is to divide the
population or data points into a number of groups, such that data points in a
group are more similar to other data points in the same group and dissimilar
to the data points in other groups. It is basically a collection of objects based
on similarity and dissimilarity between them.
I hope you got the basic idea of the KModes algorithm by now. So let us
quickly take an example to illustrate the working step by step.
Example: Imagine we have a dataset that has the information about hair
color, eye color, and skin color of persons. We aim to group them based on
the available information(maybe we want to suggest some styling ideas)
Hair color, eye color, and skin color are all categorical variables. Below 👇 is
how our dataset looks like.
Image of our data
Alright, we have the sample data now. Let us proceed by defining the number
of clusters(K)=3
Likewise, calculate all the dissimilarities and put them in a matrix as shown
below and assign the observations to their closest cluster(cluster that has the
least dissimilarity)
After step 2, the observations P1, P2, P5 are assigned to cluster 1; P3, P7 are
assigned to Cluster 2; and P4, P6, P8 are assigned to cluster 3.
Note: If all the clusters have the same dissimilarity with an observation, assign to
any cluster randomly. In our case, the observation P2 has 3 dissimilarities with all
the leaders. I randomly assigned it to Cluster 1.
Mark the observations according to the cluster they belong to. Observations
of Cluster 1 are marked in Yellow, Cluster 2 are marked in Brick red, and
Cluster 3 are marked in Purple.
Looking for Modes (Image by author)
Considering one cluster at a time, for each feature, look for the Mode and
update the new leaders.
Note: If you observe the same occurrence of values, take the mode randomly. In
our case, the observations of Cluster 3(P3, P7) have one occurrence of brown, fair
skin color. I randomly chose brown as the mode.
Likewise, calculate all the dissimilarities and put them in a matrix. Assign each
observation to its closest cluster.
The observations P1, P2, P5 are assigned to Cluster 1; P3, P7 are assigned to
Cluster 2; and P4, P6, P8 are assigned to Cluster 3.
12. Explain about Self Organizing Maps (SOM)
ANSWER:
A self-organizing map (SOM) is a type of artificial neural
network (ANN) that is trained using unsupervised learning to
produce a low-dimensional (typically two-dimensional), discretized
representation of the input space of the training samples, called
a map, and is therefore a method to do dimensionality
reduction. Self-organizing maps differ from other artificial neural
networks as they apply competitive learning as opposed to error-
correction learning (such as backpropagation with gradient descent),
and in the sense that they use a neighborhood function to preserve
the topological properties of the input space.
Referece: Applications of the growing self-organizing map, Th. Villmann, H.-U. Bauer, May 1998
The Algorithm:
Implementation:
Inference:
If the average distance is high, then the surrounding weights are very
different and a light color is assigned to the location of the weight. If
the average distance is low, a darker color is assigned. The resulting
maps show that the concentration of different clusters of species are
more predominant in three zones. First figure tells us only about
where the density of species is greater (darker regions) or less (lighter
regions). The second visualisation tells us how they are specifically
clustered.
1. It does not build a generative model for the data, i.e, the
model does not understand how data is created.
ANSWER:
Gaussian Mixture Models (GMMs) assume that there are a certain number of
Gaussian distributions, and each of these distributions represent a cluster.
Hence, a Gaussian Mixture Model tends to group the data points belonging
to a single distribution together.
Let’s say we have three Gaussian distributions (more on that in the next
section) – GD1, GD2, and GD3. These have a certain mean (μ1, μ2, μ3) and
variance (σ1, σ2, σ3) value respectively. For a given set of data points, our
GMM would identify the probability of each data point belonging to each of
these distributions.
Wait, probability?
You read that right! Gaussian Mixture Models are probabilistic models and
use the soft clustering approach for distributing the points in different
clusters. I’ll take another example that will make it easier to understand.
Here, we have three clusters that are denoted by three colors – Blue, Green,
and Cyan. Let’s take the data point highlighted in red. The probability of this
point being a part of the blue cluster is 1, while the probability of it being a
part of the green or cyan clusters is 0.
Now, consider another point – somewhere in between the blue and cyan
(highlighted in the below figure). The probability that this point is a part of
cluster green is 0, right? And the probability that this belongs to blue and cyan
is 0.2 and 0.8 respectively.
Gaussian Mixture Models use the soft clustering technique for assigning data
points to Gaussian distributions. I’m sure you’re wondering what these
distributions are so let me explain that in the next section.
I’m sure you’re familiar with Gaussian Distributions (or the Normal
Distribution). It has a bell-shaped curve, with the data points symmetrically
distributed around the mean value.
The below image has a few Gaussian distributions with a difference in mean
(μ) and variance (σ2). Remember that the higher the σ value more would be
the spread:
Source: Wikipedia
In a one dimensional space, the probability density function of a Gaussian
distribution is given by:
But this would only be true for a single variable. In the case of two variables,
instead of a 2D bell-shaped curve, we will have a 3D bell curve as shown
below:
where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2
covariance matrix. The covariance would now define the shape of this curve.
We can generalize the same for d-dimensions.
ANSWER:
The EM algorithm is considered a latent variable model to find the local
maximum likelihood parameters of a statistical model, proposed by Arthur
Dempster, Nan Laird, and Donald Rubin in 1977. The EM (Expectation-
Maximization) algorithm is one of the most commonly used terms in machine learning
to obtain maximum likelihood estimates of variables that are sometimes observable
and sometimes not. However, it is also applicable to unobserved data or sometimes
called latent. It has various real-world applications in statistics, including obtaining
the mode of the posterior marginal distribution of parameters in machine
learning and data mining applications.
In most real-life applications of machine learning, it is found that several relevant
learning features are available, but very few of them are observable, and the rest are
unobservable. If the variables are observable, then it can predict the value using
instances. On the other hand, the variables which are latent or directly not observable,
for such variables Expectation-Maximization (EM) algorithm plays a vital role to predict
the value with the condition that the general form of probability distribution governing
those latent variables is known to us. In this topic, we will discuss a basic introduction
to the EM algorithm, a flow chart of the EM algorithm, its applications, advantages,
and disadvantages of EM algorithm, etc.
What is an EM algorithm?
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local
maximum likelihood estimates (MLE) or maximum a posteriori estimates
(MAP) for unobservable variables in statistical models. Further, it is a technique to find
maximum likelihood estimation when the latent variables are present. It is also referred
to as the latent variable model.
A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable.
These unobservable variables are known as latent variables.
Key Points:
o It is known as the latent variable model to determine MLE and MAP parameters for
latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values occurs.
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such as
the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is
referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more
clearly. The second mode is known as the maximization-step or M-step.
o Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, Maximization Step, and convergence Step. These steps are
explained as follows:
o 1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained
from a specific model.
o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data. Further,
E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or not.
If it gets "yes", then stop the process; else, repeat the process from step 2 until the
convergence occurs.
15. Explain in detail about Supervised Learning and
Clustering
ANSWER:
In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.
The working of Supervised learning can be easily understood by the below example
and diagram:
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as
the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate
CLUSTERING:
It is basically a type of unsupervised learning method. An unsupervised
learning method is a method in which we draw references from datasets
consisting of input data without labeled responses. Generally, it is used as a
process to find meaningful structure, explanatory underlying processes,
generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group and dissimilar to the data points in other groups.
It is basically a collection of objects on the basis of similarity and dissimilarity
between them.
For ex– The data points in the graph below clustered together can be
classified into one single group. We can distinguish the clusters, and we can
identify that there are 3 clusters in the below picture.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping
among the unlabelled data present. There are no criteria for good clustering.
It depends on the user, what is the criteria they may use which satisfy their
need. For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and
suitable groupings (“useful” data classes) or in finding unusual data objects
(outlier detection). This algorithm must make some assumptions that
constitute the similarity of points and each assumption make different and
equally valid clusters.
Clustering Methods :
Existing Approaches
In this section you can find the two most common approaches for
choosing the number of clusters. Each has its own advantages and
limitations.
Silhouette Analysis
Selecting the number of clusters with silhouette analysis on KMeans clustering - scikit-
learn…
Silhouette analysis can be used to study the separation distance between the
resulting clusters. The silhouette plot…
scikit-learn.org
This scikit-learn Python code produces the below visualization that
help us understand whether the chosen number of clusters is a good
choice or not.
towardsdatascience.com
The two approaches above require your manual decision for the
number of clusters. Based on what I have learned from these
approaches, I have developed an automatic process for choosing K-
the number of clusters.
The suggested approach takes into account the inertia value for each
possible K and weights it by a penalty parameter. This parameter
represents the trade-off between the inertia and the number of
clusters.
Alpha is manually tuned because as I see it, the penalty for the
number of clusters is a business decision that should be incorporated
into the analysis.
Using the Scaled Inertia, the chosen K is obvious and can be done
automatically. In the above case K=9.
ANSWER:
Isomap within the family of Machine Learning algorithms
For our example, let’s create a 3D object known as a Swiss roll. The
object is made up of 2,000 individual data points. The chart
is interactive so make sure to rotate it to familiarize yourself with its
exact shape.
Interactive 3D swiss roll. Graph by author.
Euclidean vs. geodesic distances between points on a 3D Swiss roll. Image by author.
We can see that these two points are relatively close to each other
within the 3D space. If we used a linear dimensionality reduction
approach such as PCA, then the Euclidean distance between these two
points would remain somewhat similar in lower dimensions. See PCA
transformation chart below:
3D swiss roll reduced to 2 dimensions using PCA. Image by author.
Note, the shape of a 2D object in PCA looks like a picture taken of the
same 3D object but from a specific angle. This is a feature of linear
transformation.
Finding the position of points in the new lower-dimensional embedding. Image by author.
LLE variants
You should be aware of a few LLE variants, which improve upon the
original setup. However, note that these improvements come at the
cost of efficiency, making the algorithm slower. Here is how scikit-
learn describes these variants:
Similar to LLE, Isomap also uses KNN to find the nearest neighbors
in the first step. However, the second step constructs neighborhood
graphs instead of describing each point as a linear combination of its
neighbors. Then it uses these graphs to compute the shortest path
between every pair of points.
Reduces dimensions.
Finding PC1 :