Mid2 Answers
Mid2 Answers
Introduction
Ensemble Learning: A group of predictors is called an ensemble; thus, this technique is called
Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.
For example, you can train a group of Decision Tree classifiers, each on a different random
subset of the training set. To make predictions, you just obtain the predictions of all individual
trees, then predict the class that gets the most votes.
Voting Classifier
Suppose you have trained a few classifiers, each one achieving about 80% accuracy. You may
have a Logistic Regression classifier, an SVM classifier, a Random Forest classifier, a K-Nearest
Neighbors classifier, and perhaps a few more.
A very simple way to create an even better classifier is to aggregate the predictions of each
classifier and predict the class that gets the most votes. This majority-vote classifier is called a
hard voting classifier.
Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best
classifier in the ensemble. In fact, even if each classifier is a weak learner (meaning it does only
slightly better than random guessing), the ensemble can still be a strong learner (achieving high
accuracy), provided there are a sufficient number of weak learners and they are sufficiently
diverse.
How is this possible? The following analogy can help shed some light on this mystery. Suppose
you have a slightly biased coin that has a 51% chance of coming up heads, and 49% chance of
coming up tails. If you toss it 1,000 times, you will generally get more or less 510 heads and 490
tails, and hence a majority of heads. If you do the math, you will find that the probability of
obtaining a majority of heads after 1,000 tosses is close to 75%. The more you toss the coin, the
higher the probability (e.g., with 10,000 tosses, the probability climbs over 97%). This is due to
the law of large numbers: as you keep tossing the coin, the ratio of heads gets closer and closer
to the probability of heads (51%). Figure 7-3 shows 10 series of biased coin tosses. You can see
that as the number of tosses increases, the ratio of heads approaches 51%. Eventu‐ ally all 10
series end up so close to 51% that they are consistently above 50%.
Similarly, suppose you build an ensemble containing 1,000 classifiers that are individ‐ ually
correct only 51% of the time (barely better than random guessing). If you predict the majority
voted class, you can hope for up to 75% accuracy! However, this is only true if all classifiers are
perfectly independent, making uncorrelated errors, which is clearly not the case since they are
trained on the same data.
Note: Ensemble methods work best when the predictors are as independent from one another as
possible.
#Implementation Dataset - Moons
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
-----
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier( estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard' )
voting_clf.fit(X_train, y_train)
Let’s look at each classifier’s accuracy on the test set:
There you have it! The voting classifier slightly outperforms all the individual classifiers.
If all classifiers are able to estimate class probabilities (i.e., they have a pre dict_proba()
method), then you can tell Scikit-Learn to predict the class with the highest class probability,
averaged over all the individual classifiers. This is called soft voting. It often achieves higher
performance than hard voting because it gives more weight to highly confident votes. All you
need to do is replace voting="hard" with voting="soft" and ensure that all classifiers can
estimate class probabilities.
This is not the case of the SVC class by default, so you need to set its probability hyper
parameter to True (this will make the SVC class use cross-validation to estimate class
probabilities, slowing down training, and it will add a predict_proba() method).
2. Discuss RandomForest classifier.
A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method
(or sometimes pasting), typically with max_samples set to the size of the training set. Instead of
building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead use the
RandomForestClassifier class, which is more convenient and optimized for Decision Trees10
(similarly, there is a RandomForestRegressor class for regression tasks). The following code
trains a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all
available CPU cores:
Extra-Trees When you are growing a tree in a Random Forest, at each node only a random
subset of the features is considered for splitting. It is possible to make trees even more random
by also using random thresholds for each feature rather than searching for the best possible
thresholds (like regular Decision Trees do). A forest of such extremely random trees is simply
called an Extremely Randomized Trees ensemble (or Extra-Trees for short). Once again, this
trades more bias for a lower variance. It also makes Extra-Trees much faster to train than regular
Random Forests since finding the best possible threshold for each feature at every node is one of
the most time-consuming tasks of growing a tree. Extra-Trees classifier can be created using
Scikit-Learn’s ExtraTreesClassifier class. Its API is identical to the RandomForestClassifier
class. Similarly, the Extra TreesRegressor class has the same API as the RandomForestRegressor
class.
Note: It is hard to tell in advance whether a RandomForestClassifier will perform better or worse
than an ExtraTreesClassifier. Generally, the only way to know is to try both and compare them
using cross-validation (and tuning the hyperparameters using grid search).
Feature Importance Lastly, if you look at a single Decision Tree, important features are likely to
appear closer to the root of the tree, while unimportant features will often appear closer to the
leaves (or not at all). It is therefore possible to get an estimate of a feature’s importance by
computing the average depth at which it appears across all trees in the forest. Scikit-Learn
computes this automatically for every feature after training. You can access the result using the
feature_importances_ variable. For example, the following code trains a RandomForestClassifier
on the iris dataset and outputs each feature’s importance. It seems that the most important
features are the petal length (44%) and width (42%), while sepal length and width are rather
unimportant in comparison (11% and 2%, respectively):
-Training a Random Forest classifier on the MNIST dataset and plot each pixel’s importance
Random Forests are very handy to get a quick understanding of what features actually matter, in
particular if we need to perform feature selection.
The BaggingClassifier automatically performs soft voting instead of hard voting if the base
classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the
case with Decision Trees classifiers.
Figure 7-5 compares the decision boundary of a single Decision Tree with the decision boundary
of a bagging ensemble of 500 trees (from the preceding code), both trained on the moons dataset.
As you can see, the ensemble’s predictions will likely generalize much better than the single
Decision Tree’s predictions: the ensemble has a comparable bias but a smaller variance (it makes
roughly the same number of errors on the training set, but the decision boundary is less
irregular).
Out-of-Bag Evaluation With bagging, some instances may be sampled several times for any
given predictor, while others may not be sampled at all. By default a BaggingClassifier samples
m training instances with replacement (bootstrap=True), where m is the size of the training set.
This means that only about 63% of the training instances are sampled on average for each
predictor.6 The remaining 37% of the training instances that are not sampled are called out-of-
bag (oob) instances. Note that they are not the same 37% for all predictors. Since a predictor
never sees the oob instances during training, it can be evaluated on these instances, without the
need for a separate validation set or cross-validation. You can evaluate the ensemble itself by
averaging out the oob evaluations of each predictor. In Scikit-Learn, you can set
oob_score=True when creating a BaggingClassifier to request an automatic oob evaluation after
training. The following code demonstrates this. The resulting evaluation score is available
through the oob_score_ variable:
According to this oob evaluation, this BaggingClassifier is likely to achieve about 93.1%
accuracy on the test set. Let’s verify this:
We get 93.6% accuracy on the test set—close enough! The oob decision function for each
training instance is also available through the oob_decision_function_ variable. In this case
(since the base estimator has a pre dict_proba() method) the decision function returns the class
probabilities for each training instance. For example, the oob evaluation estimates that the
second training instance has a 60.6% probability of belonging to the positive class (and 39.4% of
belonging to the positive class):
The BaggingClassifier class supports sampling the features as well. This is controlled by two
hyperparameters: max_features and bootstrap_features. They work the same way as
max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus, each
predictor will be trained on a random subset of the input features. This is particularly useful
when you are dealing with high-dimensional inputs (such as images). Sampling both training
instances and features is called the Random Patches method. Keeping all training instances (i.e.,
bootstrap=False and max_sam ples=1.0) but sampling features (i.e., bootstrap_features=True
and/or max_fea tures smaller than 1.0) is called the Random Subspaces method.
Sampling features results in even more predictor diversity, trading a bit more bias for a lower
variance.
Random Patches samples both the training Instances as well as the Features. Setting certain
parameters in the BaggingClassifier() performs this for us: Random Subspaces keeps all the
instances but samples features.
With random Subspaces, estimators differentiate because of random subsets of the features.
Again, such a solution is achievable by tuning the parameters of BaggingClassifier and
BaggingRegressor, by setting max_features to a number less than 1.0, representing the
percentage of features to be chosen randomly for each model of the ensemble.
Instead, in Random Patches, estimators are built on subsets of both samples and features.
5. What is Boosting?. Explain AdaBoost algorithm.
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can
combine several weak learners into a strong learner. The general idea of most boosting methods
is to train predictors sequentially, each trying to correct its predecessor. There are many boosting
methods available, but by far the most popular are AdaBoost (short for Adaptive Boosting) and
Gradient Boosting. Let’s start with AdaBoost.
AdaBoost One way for a new predictor to correct its predecessor is to pay a bit more attention to
the training instances that the predecessor underfitted. This results in new predictors focusing
more and more on the hard cases. This is the technique used by AdaBoost. For example, to build
an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to
make predictions on the training set.- The relative weight of misclassified training instances is
then increased. A second classifier is trained using the updated weights and again it makes
predictions on the training set, weights are updated, and so on.
Figure 7-8 shows the decision boundaries of five consecutive predictors on the moons dataset (in
this example, each predictor is a highly regularized SVM classifier with an RBF kernel). The
first classifier gets many instances wrong, so their weights get boosted. The second classifier
therefore does a better job on these instances, and so on. The plot on the right represents the
same sequence of predictors except that the learning rate is halved (i.e., the misclassified
instance weights are boosted half as much at every iteration). As you can see, this sequential
learning technique has some similarities with Gradient Descent, except that instead of tweaking
a single predictor’s parameters to minimize a cost function, AdaBoost adds predictors to the
ensemble, gradually making it better.
Once all predictors are trained, the ensemble makes predictions very much like bagging or
pasting, except that predictors have different weights depending on their overall accuracy on the
weighted training set.
Note: There is one important drawback to this sequential learning technique: it cannot be
parallelized (or only partially), since each predictor can only be trained after the previous
predictor has been trained and evaluated. As a result, it does not scale as well as bagging or
pasting.
Note: If your AdaBoost ensemble is overfitting the training set, you can try reducing the number
of estimators or more strongly regularizing the base estimator.
Gradient Boosting Another very popular Boosting algorithm is Gradient Boosting. Just like
AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. However, instead of tweaking the instance weights at every iteration
like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the
previous predictor. Let’s go through a simple regression example using Decision Trees as the
base predictors (of course Gradient Boosting also works great with regression tasks). This is
called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT). First, let’s fit a
DecisionTreeRegressor to the training set (for example, a noisy quadratic training set):
Clustering
Clustering is a data science technique in machine learning that groups similar rows in a data set.
After running a clustering technique, a new column appears in the data set to indicate the group
each row of data fits into best.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the presence of noise
and outliers in the data.
There are three types of algorithms for K-Medoids Clustering:
PAM (Partitioning Around Clustering)
CLARA (Clustering Large Applications)
CLARANS (Randomized Clustering Large Applications)
The figure above shows a data set containing non-convex shape clusters and outliers. Given such
data, the k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.
1. eps: It defines the neighborhood around a data point i.e. if the distance between two points is
lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too small
then a large part of the data will be considered as an outlier. If it is chosen very large then the
clusters will merge and the majority of the data points will be in the same clusters. One way to
find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the dataset,
the larger value of MinPts must be chosen. As a general rule, the minimum MinPts can be
derived from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value
of MinPts must be chosen at least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a
core point.
Noise or outlier: A point which is not a core point or border point.
1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as the core
point.
A point a and b are said to be density connected if there exists a point c which has a sufficient
number of points in its neighbors and both points a and b are within the eps distance. This is a
chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which
in turn is neighbor of a implying that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to
any cluster are noise.
Pseudocode For DBSCAN Clustering Algorithm
DBSCAN(dataset, eps, MinPts){
# cluster index
C=1
for each unvisited point p in dataset {
mark p as visited
# find neighbors
Neighbors N = find the neighboring points of p
if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}
Implementation Of DBSCAN Algorithm Using Machine Learning In Python
Here, we’ll use the Python library sklearn to compute DBSCAN. We’ll also use the
matplotlib.pyplot library for visualizing clusters.
Import Libraries
Python3
Prepare dataset
We will create a dataset using sklearn for modeling. We make_blob for creating the dataset
Python3
# Load data in X
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Plot result
class_member_mask = (labels == k)
Output:
Cluster of dataset
We will use the Silhouette score and Adjusted rand score for evaluating clustering algorithms.
Silhouette’s score is in the range of -1 to 1. A score near 1 denotes the best meaning that the data
point i is very compact within the cluster to which it belongs and far away from the other
clusters. The worst value is -1. Values near 0 denote overlapping clusters.
Absolute Rand Score is in the range of 0 to 1. More than 0.9 denotes excellent cluster recovery,
and above 0.8 is a good recovery. Less than 0.5 is considered to be poor recovery.
Python3
# evaluation metrics
sc = metrics.silhouette_score(X, labels)
print("Silhouette Coefficient:%0.2f" % sc)
ari = adjusted_rand_score(y_true, labels)
print("Adjusted Rand Index: %0.2f" % ari)
Output:
Coefficient:0.13
Adjusted Rand Index: 0.31
Black points represent outliers. By changing the eps and the MinPts, we can change the cluster
configuration.
Now the question that should be raised is –
When Should We Use DBSCAN Over K-Means In Clustering Analysis?
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) and K-Means are both
clustering algorithms that group together data that have the same characteristic. However, They
work on different principles and are suitable for different types of data. We prefer to use
DBSCAN when the data is not spherical in shape or the number of classes is not known
beforehand.
Difference Between DBSCAN and K-Means.
DBSCAN K-Means
Clusters formed in DBSCAN can be of any Clusters formed in K-Means are spherical
arbitrary shape. or convex in shape
If we feed our model with an excessively large dataset (with a large no. of features/columns),
it gives rise to the problem of overfitting, wherein the model starts getting influenced by
outlier values and noise. This is called the Curse of Dimensionality.
The following graph represents the change in model performance with the increase in the
number of dimensions of the dataset. It can be observed that the model performance is best
only at an option dimension, beyond which it starts decreasing.
Model Performance Vs No. of Dimensions (Features) – Curse of Dimensionality
Once the standardization is done, all the variables will be transformed to the same scale.
The aim of this step is to understand how the variables of the input data set are varying from the
mean with respect to each other, or in other words, to see if there is any relationship between
them. Because sometimes, variables are highly correlated in such a way that they contain
redundant information. So, in order to identify these correlations, we compute the covariance
matrix.
STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE COVARIANCE
MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data.
What you first need to know about eigenvectors and eigenvalues is that they always come in
pairs, so that every eigenvector has an eigenvalue. Also, their number is equal to the number of
dimensions of the data. For example, for a 3-dimensional data set, there are 3 variables, therefore
there are 3 eigenvectors with 3 corresponding eigenvalues.
It is eigenvectors and eigenvalues who are behind all the magic of principal components because
the eigenvectors of the Covariance matrix are actually the directions of the axes where there is
the most variance (most information) and that we call Principal Components. And eigenvalues
are simply the coefficients attached to eigenvectors, which give the amount of variance carried
in each Principal Component.
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors
and eigenvalues of the covariance matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the eigenvector
that corresponds to the first principal component (PC1) is v1 and the one that corresponds to the
second principal component (PC2) is v2.
After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum of
eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry respectively
96 percent and 4 percent of the variance of the data.
STEP 4: CREATE A FEATURE VECTOR
As we saw in the previous step, computing the eigenvectors and ordering them by their
eigenvalues in descending order, allow us to find the principal components in order of
significance. In this step, what we do is, to choose whether to keep all these components or
discard those of lesser significance (of low eigenvalues), and form with the remaining ones a
matrix of vectors that we call Feature vector.
Continuing with the example from the previous step, we can either form a feature vector with
both of the eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser significance, and form a feature vector
with v1 only:
Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a
loss of information in the final data set. But given that v2 was carrying only 4 percent of the
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.
In the previous steps, apart from standardization, you do not make any changes on the data, you
just select the principal components and form the feature vector, but the input data set remains
always in terms of the original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis). This
can be done by multiplying the transpose of the original data set by the transpose of the feature
vector.
Steps to Apply PCA in Python for Dimensionality Reduction
We will understand the step by step approach of applying Principal Component Analysis in
Python with an example. In this example, we will use the iris dataset, which is already present
in the sklearn library of Python.
Step-1: Import necessary libraries
All the necessary libraries required to load the dataset, pre-process it and then apply PCA on it
are mentioned below:
# Import necessary libraries
from sklearn import datasets # to retrieve the iris Dataset
import pandas as pd # to load the dataframe
from sklearn.preprocessing import StandardScaler # to standardize the features
from sklearn.decomposition import PCA # to apply PCA
import seaborn as sns # to plot the heat maps
Step-2: Load the dataset
After importing all the necessary libraries, we need to load the dataset. Now, the iris dataset is
already present in sklearn. First, we will load it and then convert it into a pandas data frame for
ease of use.
#Load the Dataset
iris = datasets.load_iris()
#convert the dataset into a pandas data frame
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
#display the head (first 5 rows) of the dataset
df.head()
Now, we will check the co-relation between our scaled dataset using a heat map. For this, we
have already imported the seaborn library in Step-1. The correlation between various features
is given by the corr() function and then the heat map is plotted by the heatmap() function. The
colour scale on the side of the heatmap helps determine the magnitude of the co-relation. In
our example, we can clearly see that a darker shade represents less co-relation while a lighter
shade represents more co-relation. The diagonal of the heatmap represents the co-relation of a
feature with itself – which is always 1.0, thus, the diagonal of the heatmap is of the highest
shade.
#Check the Co-relation between features without PCA
sns.heatmap(scaled_data.corr())
We can observe from the above heatmap that sepal length & petal length and petal length &
petal width have high co-relation. Thus, we evidently need to apply dimensionality reduction.
If you are already aware that your dataset needs dimensionality reduction – you can skip this
step.
Step-4: Applying Principal Component Analysis
We will apply PCA on the scaled dataset. For this Python offers yet another in-built class
called PCA which is present in sklearn.decomposition, which we have already imported in
step-1. We need to create an object of PCA and while doing so we also need to initialize
n_components – which is the number of principal components we want in our final dataset.
Here, we have taken n_components = 3, which means our final feature set will have 3
columns. We fit our scaled data to the PCA object which gives us our reduced dataset.
#Applying PCA
#Taking no. of Principal Components as 3
pca = PCA(n_components = 3)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3'])
data_pca.head()
The above heatmap clearly depicts that there is no correlation between various obtained
principal components (PC1, PC2, and PC3). Thus, we have moved from higher dimensional
feature space to a lower-dimensional feature space while ensuring that there is no correlation
between the so obtained PCs is minimum. Hence, we have accomplished the objectives of
PCA.
Advantages of Principal Component Analysis (PCA):
1. For efficient working of ML models, our feature set needs to have features with no co-relation.
After implementing the PCA on our dataset, all the Principal Components are independent –
there is no correlation among them.
2. A Large number of feature sets lead to the issue of overfitting in models. PCA reduces the
dimensions of the feature set – thereby reducing the chances of overfitting.
3. PCA helps us reduce the dimensions of our feature set; thus, the newly formed dataset
comprising Principal Components need less disk/cloud space for storage while retaining
maximum information.
Biological Neuron
As shown in the above diagram, a typical neuron consists of the following four parts with the
help of which we can explain its working −
Dendrites − They are tree-like branches, responsible for receiving the information from other
neurons it is connected to. In other sense, we can say that they are like the ears of neuron.
Soma − It is the cell body of the neuron and is responsible for processing of information, they
have received from dendrites.
Axon − It is just like a cable through which neurons send the information.
Synapses − It is the connection between the axon and other neuron dendrites.
ANN versus BNN
Before taking a look at the differences between Artificial Neural Network ANN and Biological
Neural Network BNN, let us take a look at the similarities based on the terminology between
these two.
Biological Neural Artificial Neural
Network BNN Network ANN
Soma Node
Dendrites Input
The following diagram represents the general model of ANN followed by its processing.
For the above general model of artificial neural network, the net input can be calculated as
follows −
The output can be calculated by applying the activation function over the net input.
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.
Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while
standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0
and 1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization, and
decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data preprocessing may vary depending on the
nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
5. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of
data analysis and to avoid overfitting of the model. Some common steps involved in data
reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the original
features are high-dimensional and complex. It can be done using techniques such as PCA,
linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be
done using techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gzip compression.
As we can see from the above info that the our dataset has 9 columns and each columns has
768 values. There is no Null values in the dataset.
We can also check the null values using df.isnull()
df.isnull().sum()
Output:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
Step 3: Statistical Analysis
In statistical analysis, first, we use the df.describe() which will give a descriptive overview of
the dataset.
df.describe()
df.describe()
Output:
Data summary
The above table shows the count, mean, standard deviation, min, 25%, 50%, 75%, and max
values for each column. When we carefully observe the table we will find that. Insulin,
Pregnancies, BMI, BloodPressure columns has outliers.
Let’s plot the boxplot for each column for easy understanding.
Step 4: Check the outliers:
Python3
# Box Plots
fig, axs = plt.subplots(9,1,dpi=95, figsize=(7,17))
i=0
for col in df.columns:
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
i+=1
plt.show()
Output:
Boxplots
from the above boxplot, we can clearly see that all most every column has some amounts of
outliers.
Drop the outliers
Python3
Step 5: Correlation
Python3
#correlation
corr = df.corr()
plt.figure(dpi=130)
sns.heatmap(df.corr(), annot=True, fmt= '.2f')
plt.show()
Output:
Correlation
Python3
corr['Outcome'].sort_values(ascending = False)
Output:
Outcome 1.000000
Glucose 0.466581
BMI 0.292695
Age 0.238356
Pregnancies 0.221898
DiabetesPedigreeFunction 0.173844
Insulin 0.130548
SkinThickness 0.074752
BloodPressure 0.0
Check Outcomes Proportionality
Python3
plt.pie(df.Outcome.value_counts(),
labels= ['Diabetes', 'Not Diabetes'],
autopct='%.f', shadow=True)
plt.title('Outcome Proportionality')
plt.show()
Output:
Outcome Proportionality
Python3
# learning the statistical parameters for each of the data and transforming
rescaledX = scaler.fit_transform(X)
rescaledX[:5]
Output:
array([[0.353, 0.744, 0.59 , 0.354, 0. , 0.501, 0.234, 0.483],
[0.059, 0.427, 0.541, 0.293, 0. , 0.396, 0.117, 0.167],
[0.471, 0.92 , 0.525, 0. , 0. , 0.347, 0.254, 0.183],
[0.059, 0.447, 0.541, 0.232, 0.111, 0.419, 0.038, 0. ],
[0. , 0.688, 0.328, 0.354, 0.199, 0.642, 0.944, 0.2 ]])
Standardization
Standardization is a useful technique to transform attributes with a Gaussian distribution and
differing means and standard deviations to a standard Gaussian distribution with a mean of 0
and a standard deviation of 1.
We can standardize data using scikit-learn with the StandardScaler class.
It works well when the features have a normal distribution or when the algorithm being used is
not sensitive to the scale of the features
Python3
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
rescaledX[:5]
Output:
array([[ 0.64 , 0.848, 0.15 , 0.907, -0.693, 0.204, 0.468, 1.426],
[-0.845, -1.123, -0.161, 0.531, -0.693, -0.684, -0.365, -0.191],
[ 1.234, 1.944, -0.264, -1.288, -0.693, -1.103, 0.604, -0.106],
[-0.845, -0.998, -0.161, 0.155, 0.123, -0.494, -0.921, -1.042],
[-1.142, 0.504, -1.505, 0.907, 0.766, 1.41 , 5.485,
0.3333333333333333
# calculating hamming distance between bit strings
# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)
0.3333333333333333
https://www.w3schools.com/python/python_ml_k-means.asp
---------------------------------------------------------------
K means
https://blog.floydhub.com/introduction-to-k-means-clustering-in-python-with-scikit-learn/
make_blobs from sklearn.datasets module for doing this.
# Imports
from sklearn.datasets.samples_generator import make_blobs
# Iterate over the data points and calculate the distance using the
# given formula
for x, y in list(zip(X, Y)):
root_diff_x = (x - c_x) ** 2
root_diff_y = (y - c_y) ** 2
distance = np.sqrt(root_diff_x + root_diff_y)
distances.append(distance)
return distances
We can now apply this function to the data points and assign the results in the DataFrame
accordingly.
# Calculate the distance and assign them to the DataFrame accordingly
data['C1_Distance'] = calculate_distance(c1, data.X_value, data.Y_value)
data['C2_Distance'] = calculate_distance(c2, data.X_value, data.Y_value)
data['C3_Distance'] = calculate_distance(c3, data.X_value, data.Y_value)
Snapshot of the Euclidean distances between the data points and the centroids
Time to study the next step in the algorithm.
Compare, assign, mean and repeat
This is fundamentally the last step of the K-Means clustering algorithm. Once you have
the distances between the data points and the centroids, you compare the distances and take
the smallest ones. The centroid to which the distance for a particular data point is the
smallest, that centroid gets assigned as the cluster for that particular data point. Let’s do this
programmatically.
# Get the minimum distance centroids
data['Cluster'] = data[['C1_Distance', 'C2_Distance', 'C3_Distance']].apply(np.argmin, axis =1)
Mean update in K-Means (n denotes the number of data points belonging in a cluster)
The following lines of code does this for you:
# Calculate the coordinates of the new centroid from cluster 1
x_new_centroid1 = data[data['Cluster']=='C1']['X_value'].mean()
y_new_centroid1 = data[data['Cluster']=='C1']['Y_value'].mean()