0% found this document useful (0 votes)
2 views

Mid2 Answers

The document discusses ensemble learning techniques, focusing on the Voting Classifier, Random Forest Classifier, bagging, pasting, and boosting methods like AdaBoost. It explains how a Voting Classifier aggregates predictions from multiple classifiers to improve accuracy, while Random Forests utilize decision trees with added randomness for better performance. Additionally, it distinguishes between bagging and pasting sampling methods, and introduces boosting as a technique to sequentially improve weak learners, with AdaBoost as a key example.

Uploaded by

prasadreddy7577
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Mid2 Answers

The document discusses ensemble learning techniques, focusing on the Voting Classifier, Random Forest Classifier, bagging, pasting, and boosting methods like AdaBoost. It explains how a Voting Classifier aggregates predictions from multiple classifiers to improve accuracy, while Random Forests utilize decision trees with added randomness for better performance. Additionally, it distinguishes between bagging and pasting sampling methods, and introduces boosting as a technique to sequentially improve weak learners, with AdaBoost as a key example.

Uploaded by

prasadreddy7577
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

1. Write about Voting classifier.

Introduction
Ensemble Learning: A group of predictors is called an ensemble; thus, this technique is called
Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.

For example, you can train a group of Decision Tree classifiers, each on a different random
subset of the training set. To make predictions, you just obtain the predictions of all individual
trees, then predict the class that gets the most votes.
Voting Classifier
Suppose you have trained a few classifiers, each one achieving about 80% accuracy. You may
have a Logistic Regression classifier, an SVM classifier, a Random Forest classifier, a K-Nearest
Neighbors classifier, and perhaps a few more.

A very simple way to create an even better classifier is to aggregate the predictions of each
classifier and predict the class that gets the most votes. This majority-vote classifier is called a
hard voting classifier.

Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best
classifier in the ensemble. In fact, even if each classifier is a weak learner (meaning it does only
slightly better than random guessing), the ensemble can still be a strong learner (achieving high
accuracy), provided there are a sufficient number of weak learners and they are sufficiently
diverse.
How is this possible? The following analogy can help shed some light on this mystery. Suppose
you have a slightly biased coin that has a 51% chance of coming up heads, and 49% chance of
coming up tails. If you toss it 1,000 times, you will generally get more or less 510 heads and 490
tails, and hence a majority of heads. If you do the math, you will find that the probability of
obtaining a majority of heads after 1,000 tosses is close to 75%. The more you toss the coin, the
higher the probability (e.g., with 10,000 tosses, the probability climbs over 97%). This is due to
the law of large numbers: as you keep tossing the coin, the ratio of heads gets closer and closer
to the probability of heads (51%). Figure 7-3 shows 10 series of biased coin tosses. You can see
that as the number of tosses increases, the ratio of heads approaches 51%. Eventu‐ ally all 10
series end up so close to 51% that they are consistently above 50%.

Similarly, suppose you build an ensemble containing 1,000 classifiers that are individ‐ ually
correct only 51% of the time (barely better than random guessing). If you predict the majority
voted class, you can hope for up to 75% accuracy! However, this is only true if all classifiers are
perfectly independent, making uncorrelated errors, which is clearly not the case since they are
trained on the same data.
Note: Ensemble methods work best when the predictors are as independent from one another as
possible.
#Implementation Dataset - Moons
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
-----
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier( estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard' )
voting_clf.fit(X_train, y_train)
Let’s look at each classifier’s accuracy on the test set:

There you have it! The voting classifier slightly outperforms all the individual classifiers.
If all classifiers are able to estimate class probabilities (i.e., they have a pre dict_proba()
method), then you can tell Scikit-Learn to predict the class with the highest class probability,
averaged over all the individual classifiers. This is called soft voting. It often achieves higher
performance than hard voting because it gives more weight to highly confident votes. All you
need to do is replace voting="hard" with voting="soft" and ensure that all classifiers can
estimate class probabilities.
This is not the case of the SVC class by default, so you need to set its probability hyper
parameter to True (this will make the SVC class use cross-validation to estimate class
probabilities, slowing down training, and it will add a predict_proba() method).
2. Discuss RandomForest classifier.

A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method
(or sometimes pasting), typically with max_samples set to the size of the training set. Instead of
building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead use the
RandomForestClassifier class, which is more convenient and optimized for Decision Trees10
(similarly, there is a RandomForestRegressor class for regression tasks). The following code
trains a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all
available CPU cores:

With a few exceptions, a RandomForestClassifier has all the hyperparameters of a


DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a
BaggingClassifier to control the ensemble itself. The Random Forest algorithm introduces extra
randomness when growing trees; instead of searching for the very best feature when splitting a
node, it searches for the best feature among a random subset of features. This results in a greater
tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an
overall better model. The following BaggingClassifier is roughly equivalent to the previous
RandomForestClassifier:

Extra-Trees When you are growing a tree in a Random Forest, at each node only a random
subset of the features is considered for splitting. It is possible to make trees even more random
by also using random thresholds for each feature rather than searching for the best possible
thresholds (like regular Decision Trees do). A forest of such extremely random trees is simply
called an Extremely Randomized Trees ensemble (or Extra-Trees for short). Once again, this
trades more bias for a lower variance. It also makes Extra-Trees much faster to train than regular
Random Forests since finding the best possible threshold for each feature at every node is one of
the most time-consuming tasks of growing a tree. Extra-Trees classifier can be created using
Scikit-Learn’s ExtraTreesClassifier class. Its API is identical to the RandomForestClassifier
class. Similarly, the Extra TreesRegressor class has the same API as the RandomForestRegressor
class.
Note: It is hard to tell in advance whether a RandomForestClassifier will perform better or worse
than an ExtraTreesClassifier. Generally, the only way to know is to try both and compare them
using cross-validation (and tuning the hyperparameters using grid search).
Feature Importance Lastly, if you look at a single Decision Tree, important features are likely to
appear closer to the root of the tree, while unimportant features will often appear closer to the
leaves (or not at all). It is therefore possible to get an estimate of a feature’s importance by
computing the average depth at which it appears across all trees in the forest. Scikit-Learn
computes this automatically for every feature after training. You can access the result using the
feature_importances_ variable. For example, the following code trains a RandomForestClassifier
on the iris dataset and outputs each feature’s importance. It seems that the most important
features are the petal length (44%) and width (42%), while sepal length and width are rather
unimportant in comparison (11% and 2%, respectively):

-Training a Random Forest classifier on the MNIST dataset and plot each pixel’s importance

Random Forests are very handy to get a quick understanding of what features actually matter, in
particular if we need to perform feature selection.

3. What is bagging and pasting?Explain.


One way to get a diverse set of classifiers is to use very different training algorithms. Another
approach is to use the same training algorithm for every predictor, but to train them on different
random subsets of the training set. When sampling is performed with replacement, this method is
called bagging (short for bootstrap aggregating). When sampling is performed without
replacement, it is called pasting.
In other words, both bagging and pasting allow training instances to be sampled sev‐ eral times
across multiple predictors, but only bagging allows training instances to be sampled several
times for the same predictor. This sampling and training process is represented in Figure 7-4.
Once all predictors are trained, the ensemble can make a prediction for a new instance by simply
aggregating the predictions of all predictors. The aggregation function is typically the statistical
mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or
the average for regression. Each individual predictor has a higher bias than if it were trained on
the original training set, but aggregation reduces both bias and variance.
Generally, the net result is that the ensemble has a similar bias but a lower variance than a single
predictor trained on the original training set.
As you can see in Figure 7-4, predictors can all be trained in parallel, via different CPU cores or
even different servers. Similarly, predictions can be made in parallel. This is one of the reasons
why bagging and pasting are such popular methods: they scale very well.
Bagging and Pasting in Scikit-Learn

The BaggingClassifier automatically performs soft voting instead of hard voting if the base
classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the
case with Decision Trees classifiers.
Figure 7-5 compares the decision boundary of a single Decision Tree with the decision boundary
of a bagging ensemble of 500 trees (from the preceding code), both trained on the moons dataset.
As you can see, the ensemble’s predictions will likely generalize much better than the single
Decision Tree’s predictions: the ensemble has a comparable bias but a smaller variance (it makes
roughly the same number of errors on the training set, but the decision boundary is less
irregular).
Out-of-Bag Evaluation With bagging, some instances may be sampled several times for any
given predictor, while others may not be sampled at all. By default a BaggingClassifier samples
m training instances with replacement (bootstrap=True), where m is the size of the training set.
This means that only about 63% of the training instances are sampled on average for each
predictor.6 The remaining 37% of the training instances that are not sampled are called out-of-
bag (oob) instances. Note that they are not the same 37% for all predictors. Since a predictor
never sees the oob instances during training, it can be evaluated on these instances, without the
need for a separate validation set or cross-validation. You can evaluate the ensemble itself by
averaging out the oob evaluations of each predictor. In Scikit-Learn, you can set
oob_score=True when creating a BaggingClassifier to request an automatic oob evaluation after
training. The following code demonstrates this. The resulting evaluation score is available
through the oob_score_ variable:

According to this oob evaluation, this BaggingClassifier is likely to achieve about 93.1%
accuracy on the test set. Let’s verify this:

We get 93.6% accuracy on the test set—close enough! The oob decision function for each
training instance is also available through the oob_decision_function_ variable. In this case
(since the base estimator has a pre dict_proba() method) the decision function returns the class
probabilities for each training instance. For example, the oob evaluation estimates that the
second training instance has a 60.6% probability of belonging to the positive class (and 39.4% of
belonging to the positive class):

4. Distinguish between Random patches and Random spaces.

The BaggingClassifier class supports sampling the features as well. This is controlled by two
hyperparameters: max_features and bootstrap_features. They work the same way as
max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus, each
predictor will be trained on a random subset of the input features. This is particularly useful
when you are dealing with high-dimensional inputs (such as images). Sampling both training
instances and features is called the Random Patches method. Keeping all training instances (i.e.,
bootstrap=False and max_sam ples=1.0) but sampling features (i.e., bootstrap_features=True
and/or max_fea tures smaller than 1.0) is called the Random Subspaces method.
Sampling features results in even more predictor diversity, trading a bit more bias for a lower
variance.
Random Patches samples both the training Instances as well as the Features. Setting certain
parameters in the BaggingClassifier() performs this for us: Random Subspaces keeps all the
instances but samples features.
With random Subspaces, estimators differentiate because of random subsets of the features.
Again, such a solution is achievable by tuning the parameters of BaggingClassifier and
BaggingRegressor, by setting max_features to a number less than 1.0, representing the
percentage of features to be chosen randomly for each model of the ensemble.
Instead, in Random Patches, estimators are built on subsets of both samples and features.
5. What is Boosting?. Explain AdaBoost algorithm.

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can
combine several weak learners into a strong learner. The general idea of most boosting methods
is to train predictors sequentially, each trying to correct its predecessor. There are many boosting
methods available, but by far the most popular are AdaBoost (short for Adaptive Boosting) and
Gradient Boosting. Let’s start with AdaBoost.
AdaBoost One way for a new predictor to correct its predecessor is to pay a bit more attention to
the training instances that the predecessor underfitted. This results in new predictors focusing
more and more on the hard cases. This is the technique used by AdaBoost. For example, to build
an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to
make predictions on the training set.- The relative weight of misclassified training instances is
then increased. A second classifier is trained using the updated weights and again it makes
predictions on the training set, weights are updated, and so on.

Figure 7-8 shows the decision boundaries of five consecutive predictors on the moons dataset (in
this example, each predictor is a highly regularized SVM classifier with an RBF kernel). The
first classifier gets many instances wrong, so their weights get boosted. The second classifier
therefore does a better job on these instances, and so on. The plot on the right represents the
same sequence of predictors except that the learning rate is halved (i.e., the misclassified
instance weights are boosted half as much at every iteration). As you can see, this sequential
learning technique has some similarities with Gradient Descent, except that instead of tweaking
a single predictor’s parameters to minimize a cost function, AdaBoost adds predictors to the
ensemble, gradually making it better.

Once all predictors are trained, the ensemble makes predictions very much like bagging or
pasting, except that predictors have different weights depending on their overall accuracy on the
weighted training set.
Note: There is one important drawback to this sequential learning technique: it cannot be
parallelized (or only partially), since each predictor can only be trained after the previous
predictor has been trained and evaluated. As a result, it does not scale as well as bagging or
pasting.
Note: If your AdaBoost ensemble is overfitting the training set, you can try reducing the number
of estimators or more strongly regularizing the base estimator.
Gradient Boosting Another very popular Boosting algorithm is Gradient Boosting. Just like
AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. However, instead of tweaking the instance weights at every iteration
like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the
previous predictor. Let’s go through a simple regression example using Decision Trees as the
base predictors (of course Gradient Boosting also works great with regression tasks). This is
called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT). First, let’s fit a
DecisionTreeRegressor to the training set (for example, a noisy quadratic training set):

#y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))


Figure 7-9 represents the predictions of these three trees in the left column, and the ensemble’s
predictions in the right column. In the first row, the ensemble has just one tree, so its predictions
are exactly the same as the first tree’s predictions. In the second row, a new tree is trained on the
residual errors of the first tree. On the right you can see that the ensemble’s predictions are equal
to the sum of the predictions of the first two trees. Similarly, in the third row another tree is
trained on the residual errors of the second tree. You can see that the ensemble’s predictions
gradually get better as trees are added to the ensemble.
Figure 7-10 shows two GBRT ensembles trained with a low learning rate: the one on the left
does not have enough trees to fit the training set, while the one on the right has too many trees
and overfits the training set.
6. Explain DBSCAN algorithm with example.

Clustering
Clustering is a data science technique in machine learning that groups similar rows in a data set.
After running a clustering technique, a new column appears in the data set to indicate the group
each row of data fits into best.

DBSCAN Clustering in ML | Density based clustering


Clustering analysis or simply Clustering is basically an Unsupervised learning method that
divides the data points into a number of specific batches or groups, such that the data points in
the same groups have similar properties and data points in different groups have different
properties in some sense. It comprises many different methods based on differential evolution.

DBSCAN Clustering in ML | Density based clustering



Clustering analysis or simply Clustering is basically an Unsupervised learning method that


divides the data points into a number of specific batches or groups, such that the data points in
the same groups have similar properties and data points in different groups have different
properties in some sense. It comprises many different methods based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance), Mean-shift
(distance between points), DBSCAN (distance between nearest points), Gaussian mixtures
(Mahalanobis distance to centers), Spectral clustering (graph distance), etc.
Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities
and then we use it to cluster the data points into groups or batches. Here we will focus on
the Density-based spatial clustering of applications with noise (DBSCAN) clustering
method.
Density-Based Spatial Clustering Of Applications With Noise (DBSCAN)
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea
is that for each point of a cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the presence of noise
and outliers in the data.
There are three types of algorithms for K-Medoids Clustering:
PAM (Partitioning Around Clustering)
CLARA (Clustering Large Applications)
CLARANS (Randomized Clustering Large Applications)

Real-life data may contain irregularities, like:


1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.

The figure above shows a data set containing non-convex shape clusters and outliers. Given such
data, the k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.

Parameters Required For DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if the distance between two points is
lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too small
then a large part of the data will be considered as an outlier. If it is chosen very large then the
clusters will merge and the majority of the data points will be in the same clusters. One way to
find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the dataset,
the larger value of MinPts must be chosen. As a general rule, the minimum MinPts can be
derived from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value
of MinPts must be chosen at least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a
core point.
Noise or outlier: A point which is not a core point or border point.

Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as the core
point.
A point a and b are said to be density connected if there exists a point c which has a sufficient
number of points in its neighbors and both points a and b are within the eps distance. This is a
chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which
in turn is neighbor of a implying that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to
any cluster are noise.
Pseudocode For DBSCAN Clustering Algorithm
DBSCAN(dataset, eps, MinPts){
# cluster index
C=1
for each unvisited point p in dataset {
mark p as visited
# find neighbors
Neighbors N = find the neighboring points of p

if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}
Implementation Of DBSCAN Algorithm Using Machine Learning In Python
Here, we’ll use the Python library sklearn to compute DBSCAN. We’ll also use the
matplotlib.pyplot library for visualizing clusters.
Import Libraries
 Python3

import matplotlib.pyplot as plt


import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

Prepare dataset
We will create a dataset using sklearn for modeling. We make_blob for creating the dataset
 Python3

# Load data in X
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)

Modeling The Data Using DBSCAN


 Python3

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.


n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

# Plot result

# Black removed and is used for noise instead.


unique_labels = set(labels)
colors = ['y', 'b', 'g', 'r']
print(colors)
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'

class_member_mask = (labels == k)

xy = X[class_member_mask & core_samples_mask]


plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k',
markersize=6)

xy = X[class_member_mask & ~core_samples_mask]


plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k',
markersize=6)

plt.title('number of clusters: %d' % n_clusters_)


plt.show()

Output:
Cluster of dataset

Evaluation Metrics For DBSCAN Algorithm In Machine Learning

We will use the Silhouette score and Adjusted rand score for evaluating clustering algorithms.
Silhouette’s score is in the range of -1 to 1. A score near 1 denotes the best meaning that the data
point i is very compact within the cluster to which it belongs and far away from the other
clusters. The worst value is -1. Values near 0 denote overlapping clusters.
Absolute Rand Score is in the range of 0 to 1. More than 0.9 denotes excellent cluster recovery,
and above 0.8 is a good recovery. Less than 0.5 is considered to be poor recovery.

 Python3

# evaluation metrics
sc = metrics.silhouette_score(X, labels)
print("Silhouette Coefficient:%0.2f" % sc)
ari = adjusted_rand_score(y_true, labels)
print("Adjusted Rand Index: %0.2f" % ari)

Output:
Coefficient:0.13
Adjusted Rand Index: 0.31
Black points represent outliers. By changing the eps and the MinPts, we can change the cluster
configuration.
Now the question that should be raised is –
When Should We Use DBSCAN Over K-Means In Clustering Analysis?
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) and K-Means are both
clustering algorithms that group together data that have the same characteristic. However, They
work on different principles and are suitable for different types of data. We prefer to use
DBSCAN when the data is not spherical in shape or the number of classes is not known
beforehand.
Difference Between DBSCAN and K-Means.

DBSCAN K-Means

In DBSCAN we need not specify the K-Means is very sensitive to the


number number of clusters so it need to
of clusters. specified
DBSCAN K-Means

Clusters formed in DBSCAN can be of any Clusters formed in K-Means are spherical
arbitrary shape. or convex in shape

K-Means does not work well with


DBSCAN can work well with datasets outliers data. Outliers can skew the
having noise and outliers clusters in K-Means to a very large
extent.

In DBSCAN two parameters are required In K-Means only one parameter is


for training the Model required is for training the model

Clusters formed in K-means and DBSCAN

Outlier influence on DBSCAN

7. What is Dimensiolity reduction? Describe PCA.

If we feed our model with an excessively large dataset (with a large no. of features/columns),
it gives rise to the problem of overfitting, wherein the model starts getting influenced by
outlier values and noise. This is called the Curse of Dimensionality.
The following graph represents the change in model performance with the increase in the
number of dimensions of the dataset. It can be observed that the model performance is best
only at an option dimension, beyond which it starts decreasing.
Model Performance Vs No. of Dimensions (Features) – Curse of Dimensionality

Dimensionality Reduction is a statistical/ML-based technique wherein we try to reduce the


number of features in our dataset and obtain a dataset with an optimal number of dimensions.
One of the most common ways to accomplish Dimensionality Reduction is Feature Extraction,
wherein we reduce the number of dimensions by mapping a higher dimensional feature space
to a lower-dimensional feature space. The most popular technique of Feature Extraction is
Principal Component Analysis (PCA).
As the number of features or dimensions in a dataset increases, the amount of data required to
obtain a statistically significant result increases exponentially. This can lead to issues such as
overfitting, increased computation time, and reduced accuracy of machine learning models this
is known as the curse of dimensionality problems that arise while working with high-
dimensional data.

Principal Component Analysis (PCA)


As stated earlier, Principal Component Analysis is a technique of feature extraction that maps
a higher dimensional feature space to a lower-dimensional feature space. While reducing the
number of dimensions, PCA ensures that maximum information of the original dataset is
retained in the dataset with the reduced no. of dimensions and the co-relation between the
newly obtained Principal Components is minimum. The new features obtained after applying
PCA are called Principal Components and are denoted as PCi (i=1,2,3…n). Here, (Principal
Component-1) PC1 captures the maximum information of the original dataset, followed by
PC2, then PC3 and so on.
The following bar graph depicts the amount of Explained Variance captured by various
Principal Components. (The Explained Variance defines the amount of information captured
by the Principal Components).

Explained Variance Vs Principal Components


Using PCA with Python for Dimensionality Reduction:
What is Principal Component Analysis(PCA)?
Principal Component Analysis(PCA) technique was introduced by the mathematician Karl
Pearson in 1901. It works on the condition that while the data in a higher dimensional space is
mapped to data in a lower dimension space, the variance of the data in the lower dimensional
space should be maximum.
 Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation that converts a set of correlated variables to a set of uncorrelated variables.PCA
is the most widely used tool in exploratory data analysis and in machine learning for predictive
models. Moreover,
 Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used to
examine the interrelations among a set of variables. It is also known as a general factor analysis
where regression determines a line of best fit.
 The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a
dataset while preserving the most important patterns or relationships between the variables
without any prior knowledge of the target variables.
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by
finding a new set of variables, smaller than the original set of variables, retaining most of the
sample’s information, and useful for the regression and classification of data.

OW DO YOU DO A PRINCIPAL COMPONENT ANALYSIS?


1. Standardize the range of continuous initial variables
2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal
components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes
TEP 1: STANDARDIZATION
The aim of this step is to standardize the range of the continuous initial variables so that each
one of them contributes equally to the analysis.
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation
for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.

STEP 2: COVARIANCE MATRIX COMPUTATION

The aim of this step is to understand how the variables of the input data set are varying from the
mean with respect to each other, or in other words, to see if there is any relationship between
them. Because sometimes, variables are highly correlated in such a way that they contain
redundant information. So, in order to identify these correlations, we compute the covariance
matrix.
STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE COVARIANCE
MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data.

What you first need to know about eigenvectors and eigenvalues is that they always come in
pairs, so that every eigenvector has an eigenvalue. Also, their number is equal to the number of
dimensions of the data. For example, for a 3-dimensional data set, there are 3 variables, therefore
there are 3 eigenvectors with 3 corresponding eigenvalues.

It is eigenvectors and eigenvalues who are behind all the magic of principal components because
the eigenvectors of the Covariance matrix are actually the directions of the axes where there is
the most variance (most information) and that we call Principal Components. And eigenvalues
are simply the coefficients attached to eigenvectors, which give the amount of variance carried
in each Principal Component.

Principal Component Analysis Example:

Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors
and eigenvalues of the covariance matrix are as follows:

If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the eigenvector
that corresponds to the first principal component (PC1) is v1 and the one that corresponds to the
second principal component (PC2) is v2.

After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum of
eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry respectively
96 percent and 4 percent of the variance of the data.
STEP 4: CREATE A FEATURE VECTOR

As we saw in the previous step, computing the eigenvectors and ordering them by their
eigenvalues in descending order, allow us to find the principal components in order of
significance. In this step, what we do is, to choose whether to keep all these components or
discard those of lesser significance (of low eigenvalues), and form with the remaining ones a
matrix of vectors that we call Feature vector.

Principal Component Analysis Example:

Continuing with the example from the previous step, we can either form a feature vector with
both of the eigenvectors v1 and v2:

Or discard the eigenvector v2, which is the one of lesser significance, and form a feature vector
with v1 only:

Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a
loss of information in the final data set. But given that v2 was carrying only 4 percent of the
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.

STEP 5: RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES

In the previous steps, apart from standardization, you do not make any changes on the data, you
just select the principal components and form the feature vector, but the input data set remains
always in terms of the original axes (i.e, in terms of the initial variables).

In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis). This
can be done by multiplying the transpose of the original data set by the transpose of the feature
vector.
Steps to Apply PCA in Python for Dimensionality Reduction

We will understand the step by step approach of applying Principal Component Analysis in
Python with an example. In this example, we will use the iris dataset, which is already present
in the sklearn library of Python.
Step-1: Import necessary libraries
All the necessary libraries required to load the dataset, pre-process it and then apply PCA on it
are mentioned below:
# Import necessary libraries
from sklearn import datasets # to retrieve the iris Dataset
import pandas as pd # to load the dataframe
from sklearn.preprocessing import StandardScaler # to standardize the features
from sklearn.decomposition import PCA # to apply PCA
import seaborn as sns # to plot the heat maps
Step-2: Load the dataset
After importing all the necessary libraries, we need to load the dataset. Now, the iris dataset is
already present in sklearn. First, we will load it and then convert it into a pandas data frame for
ease of use.
#Load the Dataset
iris = datasets.load_iris()
#convert the dataset into a pandas data frame
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
#display the head (first 5 rows) of the dataset
df.head()

Step-3: Standardize the features


Before applying PCA or any other Machine Learning technique it is always considered good
practice to standardize the data. For this, Standard Scalar is the most commonly used
scalar. Standard Scalar is already present in sklearn. So, now we will standardize the feature
set using Standard Scalar and store the scaled feature set as a pandas data frame.
#Standardize the features
#Create an object of StandardScaler which is present in sklearn.preprocessing
scalar = StandardScaler()
scaled_data = pd.DataFrame(scalar.fit_transform(df)) #scaling the data
scaled_data
Step-3: Check the Co-relation between features without PCA (Optional)

Now, we will check the co-relation between our scaled dataset using a heat map. For this, we
have already imported the seaborn library in Step-1. The correlation between various features
is given by the corr() function and then the heat map is plotted by the heatmap() function. The
colour scale on the side of the heatmap helps determine the magnitude of the co-relation. In
our example, we can clearly see that a darker shade represents less co-relation while a lighter
shade represents more co-relation. The diagonal of the heatmap represents the co-relation of a
feature with itself – which is always 1.0, thus, the diagonal of the heatmap is of the highest
shade.
#Check the Co-relation between features without PCA
sns.heatmap(scaled_data.corr())

We can observe from the above heatmap that sepal length & petal length and petal length &
petal width have high co-relation. Thus, we evidently need to apply dimensionality reduction.
If you are already aware that your dataset needs dimensionality reduction – you can skip this
step.
Step-4: Applying Principal Component Analysis
We will apply PCA on the scaled dataset. For this Python offers yet another in-built class
called PCA which is present in sklearn.decomposition, which we have already imported in
step-1. We need to create an object of PCA and while doing so we also need to initialize
n_components – which is the number of principal components we want in our final dataset.
Here, we have taken n_components = 3, which means our final feature set will have 3
columns. We fit our scaled data to the PCA object which gives us our reduced dataset.

#Applying PCA
#Taking no. of Principal Components as 3
pca = PCA(n_components = 3)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3'])
data_pca.head()

Step-5: Checking Co-relation between features after PCA


Now that we have applied PCA and obtained the reduced feature set, we will check the co-
relation between various Principal Components, again by using a heatmap.
#Checking Co-relation between features after PCA
sns.heatmap(data_pca.corr())

The above heatmap clearly depicts that there is no correlation between various obtained
principal components (PC1, PC2, and PC3). Thus, we have moved from higher dimensional
feature space to a lower-dimensional feature space while ensuring that there is no correlation
between the so obtained PCs is minimum. Hence, we have accomplished the objectives of
PCA.
Advantages of Principal Component Analysis (PCA):
1. For efficient working of ML models, our feature set needs to have features with no co-relation.
After implementing the PCA on our dataset, all the Principal Components are independent –
there is no correlation among them.
2. A Large number of feature sets lead to the issue of overfitting in models. PCA reduces the
dimensions of the feature set – thereby reducing the chances of overfitting.
3. PCA helps us reduce the dimensions of our feature set; thus, the newly formed dataset
comprising Principal Components need less disk/cloud space for storage while retaining
maximum information.

6. Define ANN. Draw and explain Neural networks architecture.


Artificial Neural Network ANN
is an efficient computing system whose central theme is borrowed from the analogy of
biological neural networks. ANNs are also named as “artificial neural systems,” or “parallel
distributed processing systems,” or “connectionist systems.” ANN acquires a large collection of
units that are interconnected in some pattern to allow communication between the units. These
units, also referred to as nodes or neurons, are simple processors which operate in parallel.
Every neuron is connected with other neuron through a connection link. Each connection link is
associated with a weight that has information about the input signal. This is the most useful
information for neurons to solve a particular problem because the weight usually excites or
inhibits the signal that is being communicated. Each neuron has an internal state, which is called
an activation signal. Output signals, which are produced after combining the input signals and
activation rule, may be sent to other units.

Biological Neuron

A nerve cell neuron������ is a special biological cell that processes information.


According to an estimation, there are huge number of neurons, approximately 10 11 with
numerous interconnections, approximately 1015.
Schematic Diagram

Working of a Biological Neuron

As shown in the above diagram, a typical neuron consists of the following four parts with the
help of which we can explain its working −

 Dendrites − They are tree-like branches, responsible for receiving the information from other
neurons it is connected to. In other sense, we can say that they are like the ears of neuron.
 Soma − It is the cell body of the neuron and is responsible for processing of information, they
have received from dendrites.
 Axon − It is just like a cable through which neurons send the information.
 Synapses − It is the connection between the axon and other neuron dendrites.
ANN versus BNN
Before taking a look at the differences between Artificial Neural Network ANN and Biological
Neural Network BNN, let us take a look at the similarities based on the terminology between
these two.
Biological Neural Artificial Neural
Network BNN Network ANN

Soma Node

Dendrites Input

Synapse Weights or Interconnections


Axon Output

Model of Artificial Neural Network

The following diagram represents the general model of ANN followed by its processing.

For the above general model of artificial neural network, the net input can be calculated as
follows −

The output can be calculated by applying the activation function over the net input.

7. What is TensorFlow? Explain how preprocessing is done using TensorFlow.


TensorFlow Transform is a library for preprocessing input data for TensorFlow, including
creating features that require a full pass over the training dataset. For example, using
TensorFlow Transform you could: Normalize an input value by using the mean and standard
deviation.
TensorFlow is an open-source library developed by Google primarily for deep learning
applications. It also supports traditional machine learning. TensorFlow was originally developed
for large numerical computations without keeping deep learning in mind.
Tensors are multi-dimensional arrays with a uniform type (called a dtype ). You can see all
supported dtypes at tf. dtypes . If you're familiar with NumPy, tensors are (kind of) like np.
arrays .
An Operation is a node in a tf. Graph that takes zero or more Tensor objects as input, and
produces zero or more Tensor objects as output. Objects of type Operation are created by calling
a Python op constructor (such as tf. matmul ) within a tf.
TensorFlow allows you to define and build machine learning and deep learning models using a
high-level API known as Keras (integrated into TensorFlow as tf. keras). Keras offers a simple
and user-friendly way to create models by stacking layers.

Preprocessing data for ML

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.
Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while
standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0
and 1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization, and
decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data preprocessing may vary depending on the
nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due
to faulty data collection, data entry errors etc. It can be handled in following ways :
3. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
4. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables).

5. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of
data analysis and to avoid overfitting of the model. Some common steps involved in data
reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the original
features are high-dimensional and complex. It can be done using techniques such as PCA,
linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be
done using techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gzip compression.

Need of Data Preprocessing


 For achieving better results from the applied model in Machine Learning projects the format of
the data has to be in a proper manner.
Steps in Data Preprocessing
Step 1: Import the necessary libraries
# importing libraries
import pandas as pd
import scipy
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
Step 2: Load the dataset
Dataset link: [https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database]
# Load the dataset
df = pd.read_csv('Geeksforgeeks/Data/diabetes.csv')
print(df.head())
Check the data info
df.info()

As we can see from the above info that the our dataset has 9 columns and each columns has
768 values. There is no Null values in the dataset.
We can also check the null values using df.isnull()
df.isnull().sum()
Output:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
Step 3: Statistical Analysis
In statistical analysis, first, we use the df.describe() which will give a descriptive overview of
the dataset.
df.describe()

df.describe()

Output:

Data summary

The above table shows the count, mean, standard deviation, min, 25%, 50%, 75%, and max
values for each column. When we carefully observe the table we will find that. Insulin,
Pregnancies, BMI, BloodPressure columns has outliers.
Let’s plot the boxplot for each column for easy understanding.
Step 4: Check the outliers:

 Python3

# Box Plots
fig, axs = plt.subplots(9,1,dpi=95, figsize=(7,17))
i=0
for col in df.columns:
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
i+=1
plt.show()

Output:
Boxplots

from the above boxplot, we can clearly see that all most every column has some amounts of
outliers.
Drop the outliers

 Python3

# Identify the quartiles


q1, q3 = np.percentile(df['Insulin'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = df[(df['Insulin'] >= lower_bound)
& (df['Insulin'] <= upper_bound)]
# Identify the quartiles
q1, q3 = np.percentile(clean_data['Pregnancies'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Pregnancies'] >= lower_bound)
& (clean_data['Pregnancies'] <= upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['Age'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Age'] >= lower_bound)
& (clean_data['Age'] <= upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['Glucose'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Glucose'] >= lower_bound)
& (clean_data['Glucose'] <= upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['BloodPressure'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (0.75 * iqr)
upper_bound = q3 + (0.75 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['BloodPressure'] >= lower_bound)
& (clean_data['BloodPressure'] <= upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['BMI'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['BMI'] >= lower_bound)
& (clean_data['BMI'] <= upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['DiabetesPedigreeFunction'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

# Drop the outliers


clean_data = clean_data[(clean_data['DiabetesPedigreeFunction'] >= lower_bound)
& (clean_data['DiabetesPedigreeFunction'] <= upper_bound)]

Step 5: Correlation

 Python3

#correlation
corr = df.corr()
plt.figure(dpi=130)
sns.heatmap(df.corr(), annot=True, fmt= '.2f')
plt.show()

Output:
Correlation

We can also camapare by single columns in descending order

 Python3

corr['Outcome'].sort_values(ascending = False)

Output:
Outcome 1.000000
Glucose 0.466581
BMI 0.292695
Age 0.238356
Pregnancies 0.221898
DiabetesPedigreeFunction 0.173844
Insulin 0.130548
SkinThickness 0.074752
BloodPressure 0.0
Check Outcomes Proportionality

 Python3

plt.pie(df.Outcome.value_counts(),
labels= ['Diabetes', 'Not Diabetes'],
autopct='%.f', shadow=True)
plt.title('Outcome Proportionality')
plt.show()

Output:
Outcome Proportionality

Step 6: Separate independent features and Target Variables

 Python3

# separate array into input and output components


X = df.drop(columns =['Outcome'])
Y = df.Outcome

Step 7: Normalization or Standardization


Normalization
 MinMaxScaler scales the data so that each feature is in the range [0, 1].
 It works well when the features have different scales and the algorithm being used is sensitive
to the scale of the features, such as k-nearest neighbors or neural networks.
 Rescale your data using scikit-learn using the MinMaxScaler.
 Python3

# initialising the MinMaxScaler


scaler = MinMaxScaler(feature_range=(0, 1))

# learning the statistical parameters for each of the data and transforming
rescaledX = scaler.fit_transform(X)
rescaledX[:5]

Output:
array([[0.353, 0.744, 0.59 , 0.354, 0. , 0.501, 0.234, 0.483],
[0.059, 0.427, 0.541, 0.293, 0. , 0.396, 0.117, 0.167],
[0.471, 0.92 , 0.525, 0. , 0. , 0.347, 0.254, 0.183],
[0.059, 0.447, 0.541, 0.232, 0.111, 0.419, 0.038, 0. ],
[0. , 0.688, 0.328, 0.354, 0.199, 0.642, 0.944, 0.2 ]])
Standardization
 Standardization is a useful technique to transform attributes with a Gaussian distribution and
differing means and standard deviations to a standard Gaussian distribution with a mean of 0
and a standard deviation of 1.
 We can standardize data using scikit-learn with the StandardScaler class.
 It works well when the features have a normal distribution or when the algorithm being used is
not sensitive to the scale of the features
 Python3
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
rescaledX[:5]

Output:
array([[ 0.64 , 0.848, 0.15 , 0.907, -0.693, 0.204, 0.468, 1.426],
[-0.845, -1.123, -0.161, 0.531, -0.693, -0.684, -0.365, -0.191],
[ 1.234, 1.944, -0.264, -1.288, -0.693, -1.103, 0.604, -0.106],
[-0.845, -0.998, -0.161, 0.155, 0.123, -0.494, -0.921, -1.042],
[-1.142, 0.504, -1.505, 0.907, 0.766, 1.41 , 5.485,

8. Discuss how k-means clustering used for image segmentation.


iris.target_names

Distance based models


https://machinelearningmastery.com/distance-measures-for-machine-learning/

from math import sqrt


# calculate minkowski distance
def minkowski_distance(a, b, p):
return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)
# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

2. # calculating manhattan distance between vectors


from scipy.spatial.distance import cityblock
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = cityblock(row1, row2)
print(dist)
3. Euclidean distance
# calculating euclidean distance between vectors
from scipy.spatial.distance import euclidean
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = euclidean(row1, row2)
print(dist)

4. # calculating hamming distance between bit strings


from scipy.spatial.distance import hamming
# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming(row1, row2)
print(dist)

0.3333333333333333
# calculating hamming distance between bit strings

# calculate hamming distance


def hamming_distance(a, b):
return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a)

# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)
0.3333333333333333

https://www.w3schools.com/python/python_ml_k-means.asp
---------------------------------------------------------------
K means
https://blog.floydhub.com/introduction-to-k-means-clustering-in-python-with-scikit-learn/
make_blobs from sklearn.datasets module for doing this.
# Imports
from sklearn.datasets.samples_generator import make_blobs

# Generate 2D data points


X, _ = make_blobs(n_samples=10, centers=3, n_features=2, cluster_std=0.2, random_state=0)

# Convert the data points into a pandas DataFrame


import pandas as pd

# Generate indicators for the data points


obj_names = []
for i in range(1, 11):
obj = "Object " + str(i)
obj_names.append(obj)

# Create a pandas DataFrame with the names and (x, y) coordinates


data = pd.DataFrame({
'Object': obj_names,
'X_value': X[:, 0],
'Y_value': X[:, -1]
})

# Preview the data


print(data.head())
 Initialize random centroids
You start the process by taking three(as we decided K to be 3) random points (in the form of (x,
y)). These points are called centroids which is just a fancy name for denoting centers. Let’s
name these three points - C1, C2, and C3 so that you can refer them later.

Step 1 in K-Means: Random centroids


 Calculate distances between the centroids and the data points
Next, you measure the distances of the data points from these three randomly chosen
points. A very popular choice of distance measurement function, in this case, is
the Euclidean distance.
Briefly, if there are n points on a 2D space(just like the above figure) and their
coordinates are denoted by (x_i, y_i), then the Euclidean distance between any two points
((x1, y1) and(x2, y2)) on this space is given by:

Equation for Euclidean distance


Suppose the coordinates of C1, C2 and C3 are - (-1, 4), (-0.2, 1.5) and (2, 2.5)
respectively. Let’s now write a few lines of Python code which will calculate the
Euclidean distances between the data-points and these randomly chosen centroids.
We start by initializing the centroids.
# Initialize the centroids
c1 = (-1, 4)
c2 = (-0.2, 1.5)
c3 = (2, 2.5)
Next, we write a small helper function to calculate the Euclidean distances between the data
points and centroids.
# A helper function to calculate the Euclidean diatance between the data
# points and the centroids

def calculate_distance(centroid, X, Y):


distances = []

# Unpack the x and y coordinates of the centroid


c_x, c_y = centroid

# Iterate over the data points and calculate the distance using the
# given formula
for x, y in list(zip(X, Y)):
root_diff_x = (x - c_x) ** 2
root_diff_y = (y - c_y) ** 2
distance = np.sqrt(root_diff_x + root_diff_y)
distances.append(distance)

return distances
We can now apply this function to the data points and assign the results in the DataFrame
accordingly.
# Calculate the distance and assign them to the DataFrame accordingly
data['C1_Distance'] = calculate_distance(c1, data.X_value, data.Y_value)
data['C2_Distance'] = calculate_distance(c2, data.X_value, data.Y_value)
data['C3_Distance'] = calculate_distance(c3, data.X_value, data.Y_value)

# Preview the data


print(data.head())
The output should be like the following:

Snapshot of the Euclidean distances between the data points and the centroids
Time to study the next step in the algorithm.
 Compare, assign, mean and repeat
This is fundamentally the last step of the K-Means clustering algorithm. Once you have
the distances between the data points and the centroids, you compare the distances and take
the smallest ones. The centroid to which the distance for a particular data point is the
smallest, that centroid gets assigned as the cluster for that particular data point. Let’s do this
programmatically.
# Get the minimum distance centroids
data['Cluster'] = data[['C1_Distance', 'C2_Distance', 'C3_Distance']].apply(np.argmin, axis =1)

# Map the centroids accordingly and rename them


data['Cluster'] = data['Cluster'].map({'C1_Distance': 'C1', 'C2_Distance': 'C2', 'C3_Distance':
'C3'})

# Get a preview of the data


print(data.head(10))

You get a nicely formatted output:

Clusters after one iteration of K-means


With this step, we complete an iteration of the K-Means cloistering algorithm. Take a
closer look at the output - there’s no C2 in there.
Now comes the most interesting part of updating the centroids by determining
the mean values of the coordinates of the data points (which should be belonging
to some centroid by now). Hence the name K-Means. This is how the mean calculation
looks like:

Mean update in K-Means (n denotes the number of data points belonging in a cluster)
The following lines of code does this for you:
# Calculate the coordinates of the new centroid from cluster 1
x_new_centroid1 = data[data['Cluster']=='C1']['X_value'].mean()
y_new_centroid1 = data[data['Cluster']=='C1']['Y_value'].mean()

# Calculate the coordinates of the new centroid from cluster 2


x_new_centroid2 = data[data['Cluster']=='C3']['X_value'].mean()
y_new_centroid2 = data[data['Cluster']=='C3']['Y_value'].mean()

# Print the coordinates of the new centroids


print('Centroid 1 ({}, {})'.format(x_new_centroid1, y_new_centroid1))
print('Centroid 2 ({}, {})'.format(x_new_centroid2, y_new_centroid2))
You get:
Coordinates of the new centroids
Notice that the algorithm, in its first iteration, grouped the data points into two clusters
although we specified this number to be 3. The following animation gives you a pretty good
overview of how centroid updates take place in the K-Means algorithm.

Centroid updates in K-Means

You might also like