MC4301 - ML Unit 2 (Model Evaluation and Feature Engineering)
MC4301 - ML Unit 2 (Model Evaluation and Feature Engineering)
Model Selection
The challenge of applied machine learning, therefore, becomes how to choose among
a range of different models that you can use for your problem.
Naively, you might believe that model performance is sufficient, but should you
consider other concerns, such as how long the model takes to train or how easy it is to
explain to project stakeholders. Their concerns become more pressing if a chosen
model must be used operationally for months or years.
Model selection is the process of selecting one final machine learning model from
among a collection of candidate machine learning models for a training dataset.
Model selection is a process that can be applied both across different types of models
(e.g. logistic regression, SVM, KNN, etc.) and across models of the same type
configured with different model hyperparameters (e.g. different kernels in an SVM).
For example, we may have a dataset for which we are interested in developing a
classification or regression predictive model. We do not know beforehand as to which
model will perform best on this problem, as it is unknowable. Therefore, we fit and
evaluate a suite of different models on the problem.
Model selection is the process of choosing one of the models as the final model that
addresses the problem.
For example, we evaluate or assess candidate models in order to choose the best one,
and this is model selection. Whereas once a model is chosen, it can be evaluated in
order to communicate how well it is expected to perform in general; this is model
assessment.
1
Considerations for Model Selection
Fitting models is relatively straightforward, although selecting among them is the true
challenge of applied machine learning.
All models have some predictive error, given the statistical noise in the data, the
incompleteness of the data sample, and the limitations of each different model type.
Therefore, the notion of a perfect or best model is not useful. Instead, we must seek a
model that is “good enough.”
The project stakeholders may have specific requirements, such as maintainability and
limited model complexity. As such, a model that has lower skill but is simpler and
easier to understand may be preferred.
Alternately, if model skill is prized above all other concerns, then the ability of the
model to perform well on out-of-sample data will be preferred regardless of the
computational complexity involved.
Therefore, a “good enough” model may refer to many things and is specific to your
project, such as:
For example, we are not selecting a fit model, as all models will be discarded. This is
because once we choose a model, we will fit a new final model on all available data
and start using it to make predictions.
Therefore, are we choosing among algorithms used to fit the models on the training
dataset?
Some algorithms require specialized data preparation in order to best expose the
structure of the problem to the learning algorithm. Therefore, we must go one step
further and consider model selection as the process of selecting among model
development pipelines.
Each pipeline may take in the same raw training dataset and outputs a model that can
be evaluated in the same manner but may require different or overlapping
computational steps, such as:
Data filtering.
2
Data transformation.
Feature selection.
Feature engineering.
And more…
The best approach to model selection requires “sufficient” data, which may be nearly
infinite depending on the complexity of the problem.
In this ideal situation, we would split the data into training, validation, and test sets,
then fit candidate models on the training set, evaluate and select them on the
validation set, and report the performance of the final model on the test set.
This is impractical on most predictive modeling problems given that we rarely have
sufficient data, or are able to even judge what would be sufficient.
Instead, there are two main classes of techniques to approximate the ideal case of
model selection; they are:
Probabilistic Measures
Probabilistic measures involve analytically scoring a candidate model using both its
performance on the training dataset and the complexity of the model.
It is known that training error is optimistically biased, and therefore is not a good
basis for choosing a model. The performance can be penalized based on how
optimistic the training error is believed to be. This is typically achieved using
algorithm-specific methods, often linear, that penalize the score based on the
complexity of the model.
3
of a penalty term to compensate for the over-fitting of more complex
models.
A model with fewer parameters is less complex, and because of this, is preferred
because it is likely to generalize better on average.
Probabilistic measures are appropriate when using simpler linear models like linear
regression or logistic regression where the calculating of model complexity penalty
(e.g. in sample bias) is known and tractable.
Resampling Methods
Resampling methods seek to estimate the performance of a model (or more precisely,
the model development process) on out-of-sample data.
This is achieved by splitting the training dataset into sub train and test sets, fitting a
model on the sub train set, and evaluating it on the test set. This process may then be
repeated multiple times and the mean performance across each trial is reported.
Most of the time probabilistic measures are not available, therefore resampling
methods are used.
By far the most popular is the cross-validation family of methods that includes many
subtypes.
Probably the simplest and most widely used method for estimating
prediction error is cross-validation.
An example is the widely used k-fold cross-validation that splits the training dataset
into k folds where each example appears in a test set only once.
4
Another is the leave one out (LOOCV) where the test set is comprised of a single
sample and each sample is given an opportunity to be the test set, requiring N (the
number of samples in the training set) models to be constructed and evaluated.
Training Model
This iterative process is called “model fitting”. The accuracy of the training dataset or
the validation dataset is critical for the precision of the model.
Supervised learning is possible when the training data contains both the input and
output values. Each set of data that has the inputs and the expected output is called a
supervisory signal. The training is done based on the deviation of the processed result
from the documented result when the inputs are fed into the model.
The process of training an ML model involves providing an ML algorithm (that is, the
learning algorithm) with training data to learn from. The term ML model refers to the
model artifact that is created by the training process.
The training data must contain the correct answer, which is known as a target or
target attribute. The learning algorithm finds patterns in the training data that map the
input data attributes to the target (the answer that you want to predict), and it outputs
an ML model that captures these patterns.
5
You can use the ML model to get predictions on new data for which you do not know
the target. For example, let's say that you want to train an ML model to predict if an
email is spam or not spam. You would provide Amazon ML with training data that
contains emails for which you know the target (that is, a label that tells whether an
email is spam or not spam). Amazon ML would train an ML model by using this data,
resulting in a model that attempts to predict whether new email will be spam or not
spam.
Types of ML Models
ML models for binary classification problems predict a binary outcome (one of two
possible classes). To train binary classification models, Amazon ML uses the
industry-standard learning algorithm known as logistic regression.
Regression Model
ML models for regression problems predict a numeric value. For training regression
models, Amazon ML uses the industry-standard learning algorithm known as linear
regression.
6
Examples of Regression Problems
Training Process
Training Parameters
Typically, machine learning algorithms accept parameters that can be used to control
certain properties of the training process and of the resulting ML model. In Amazon
Machine Learning, these are called training parameters. You can set these parameters
using the Amazon ML console, API, or command line interface (CLI). If you do not
set any parameters, Amazon ML will use default values that are known to work well
for a large range of machine learning tasks.
Shuffle type
Regularization type
Regularization amount
The maximum model size is the total size, in units of bytes, of patterns that Amazon
ML creates during the training of an ML model.
7
If Amazon ML can't find enough patterns to fill the model size, it creates a smaller
model. For example, if you specify a maximum model size of 100 MB, but Amazon
ML finds patterns that total only 50 MB, the resulting model will be 50 MB. If
Amazon ML finds more patterns than will fit into the specified size, it enforces a
maximum cut-off by trimming the patterns that least affect the quality of the learned
model.
Choosing the model size allows you to control the trade-off between a model's
predictive quality and the cost of use. Smaller models can cause Amazon ML to
remove many patterns to fit within the maximum size limit, affecting the quality of
predictions. Larger models, on the other hand, cost more to query for real-time
predictions.
For best results, Amazon ML may need to make multiple passes over your data to
discover patterns. By default, Amazon ML makes 10 passes, but you can change the
default by setting a number up to 100. Amazon ML keeps track of the quality of
patterns (model convergence) as it goes along, and automatically stops the training
when there are no more data points or patterns to discover. For example, if you set the
number of passes to 20, but Amazon ML discovers that no new patterns can be found
by the end of 15 passes, then it will stop the training at 15 passes.
In Amazon ML, you must shuffle your training data. Shuffling mixes up the order of
your data so that the SGD algorithm doesn't encounter one type of data for too many
observations in succession. For example, if you are training an ML model to predict a
product type, and your training data includes movie, toy, and video game product
types, if you sorted the data by the product type column before uploading it, the
algorithm sees the data alphabetically by product type. The algorithm sees all of your
data for movies first, and your ML model begins to learn patterns for movies. Then,
when your model encounters data on toys, every update that the algorithm makes
would fit the model to the toy product type, even if those updates degrade the patterns
that fit movies. This sudden switch from movie to toy type can produce a model that
doesn't learn how to predict product types accurately.
Regularization helps prevent linear models from overfitting training data examples by
penalizing extreme weight values. L1 regularization reduces the number of features
used in the model by pushing the weight of features that would otherwise have very
small weights to zero. L1 regularization produces sparse models and reduces the
8
amount of noise in the model. L2 regularization results in smaller overall weight
values, which stabilizes the weights when there is high correlation between the
features. You can control the amount of L1 or L2 regularization by using the
Regularization amount parameter. Specifying an extremely large Regularization
amount value can cause all features to have zero weight.
Selecting and tuning the optimal regularization value is an active subject in machine
learning research. You will probably benefit from selecting a moderate amount of L2
regularization, which is the default in the Amazon ML console. Advanced users can
choose between three types of regularization (none, L1, or L2) and amount. For more
information about regularization, go to Regularization (mathematics)
ollowing table lists the Amazon ML training parameters, along with the default values
and the allowable range for each.
9
Training Parameter Type Default Value Description
useful.
Creating an ML Model
After you've created a datasource, you are ready to create an ML model. If you use
the Amazon Machine Learning console to create a model, you can choose to use the
default settings or you customize your model by applying custom options.
Split the input data to use the first 70 percent for training and use the
remaining 30 percent for evaluation
Choose a training/evaluation splitting ratio other than the default 70/30 ratio or
provide another datasource that you have already prepared for evaluation.
You can also choose the default values for any of these settings.
10
If you've already created a model using the default options and want to improve your
model's predictive performance, use the Custom option to create a new model with
some customized settings. For example, you might add more feature transformations
to the recipe or increase the number of passes in the training parameter.
A machine learning model is a program that can find patterns or make decisions from
a previously unseen dataset. For example, in natural language processing, machine
learning models can parse and correctly recognize the intent behind previously
unheard sentences or combinations of words. In image recognition, a machine
learning model can be taught to recognize objects - such as cars or dogs. A machine
learning model can perform such tasks by having it ‘trained’ with a large dataset.
During training, the machine learning algorithm is optimized to find certain patterns
or outputs from the dataset, depending on the task. The output of this process - often a
computer program with specific rules and data structures - is called a machine
learning model.
The process of running a machine learning algorithm on a dataset (called training data)
and optimizing the algorithm to find certain patterns or outputs is called model
training. The resulting function with rules and data structures is called the trained
machine learning model.
11
What is Unsupervised Machine Learning?
In unsupervised machine learning, the algorithm is provided an input dataset, but not
rewarded or optimized to specific outputs, and instead trained to group objects by
common characteristics. For example, recommendation engines on online stores rely
on unsupervised machine learning, specifically a technique called clustering.
In reinforcement learning, the algorithm is made to train itself using many trial and
error experiments. Reinforcement learning happens when the algorithm interacts
continually with the environment, rather than relying on training data. One of the
most popular examples of reinforcement learning is autonomous driving.
There are many machine learning models, and almost all of them are based on certain
machine learning algorithms. Popular classification and regression algorithms fall
under supervised machine learning, and clustering algorithms are generally deployed
in unsupervised machine learning scenarios.
12
Hierarchical Clustering: Hierarchical clustering builds a tree of nested clusters
without having to specify the number of clusters.
Regression in data science and machine learning is a statistical method that enables
predicting outcomes based on a set of input variables. The outcome is often a variable
that depends on a combination of the input variables.
13
What is a Classifier in Machine Learning?
Many! Machine learning is an evolving field and there are always more machine
learning models being developed.
The machine learning model most suited for a specific situation depends on the
desired outcome. For example, to predict the number of vehicle purchases in a city
from historical data, a supervised learning technique such as linear regression might
be most useful. On the other hand, to identify if a potential customer in that city
would purchase a vehicle, given their income and commuting history, a decision tree
might work best.
Model deployment is the process of making a machine learning model available for
use on a target environment—for testing or production. The model is usually
integrated with other applications in the environment (such as databases and UI)
through APIs. Deployment is the stage after which an organization can actually make
14
a return on the heavy investment made in model development.
Model Interpretability
Some models, like logistic regression, are considered to be fairly straightforward and
therefore highly interpretable, but as you add features or use more complicated
machine learning models such as deep learning, interpretability gets more and more
difficult.
15
Evaluating Performance of a Model
Training set – according to this insightful article on model evaluation, this refers to a
subset of a dataset used to build predictive models. It includes a set of input examples
that will be used to train a model by adjusting the parameters of the set.
Test set – this is also known as unseen data. It is the final evaluation that a model
undergoes after the training phase. A test set is best defined in this article as a subset
of a dataset used to assess the possible future performance of a model. For example, if
a model fits the training better than the test set, overfitting is likely present.
Overfitting– refers to when a model contains more parameters than can be accounted
for by the dataset. Noisy data contributes to overfitting. The generalization of these
models is unreliable since the model learns more than it is meant to from the dataset.
Machine learning has become integral to our daily lives. We interact with some form
of machine learning every single day. Since we truly depend on machine learning
models for various reasons, it’s important to have models that provide accurate and
trustworthy predictions for their respective use cases. We must always test how a
model generalizes on unseen data.
For example, in an enterprise setting, these models need to offer real value to the
business by producing the highest levels of performance. But how do we evaluate the
performance of a model? For classification problems, a very common and obvious
answer is to measure its accuracy.
The techniques to evaluate the performance of a model can be divided into two parts:
cross-validation and holdout. Both these techniques make use of a test set to assess
model performance.
Cross validation
16
Let’s explore how.
First, we split the dataset into groups of instances equal in size. These groups are
called folds. The model to be evaluated is trained on all the folds except one. After
training, we test the model on the fold that was excluded. This process is then
repeated over and over again, depending on the number of folds.
If there are six folds, we will repeat the process six times. The reason for the
repetition is that each fold gets to be excluded and act as the test set. Last, we measure
the average performance across all folds to get an estimation of how effective the
algorithm is on a problem.
The model will be trained and tested four times. Let’s say the first-round trains on
folds 1,2 and 3. The testing will be on fold 4. For the second round, it may train on
folds 1,2, and 4 and test on fold 3. For the third, it may train on folds 1,3, and 4 and
test on fold 2.
The last round will test on folds 2,3 and 4 and test on fold 1. The interchange between
training and test data makes this method very effective. However, compared to the
holdout technique, cross-validation takes more time to run and uses more
computational resources.
Holdout
It’s important to get an unbiased estimate of model performance. This is exactly what
the holdout technique offers. To get this unbiased estimate, we test a model on data
different from the data we trained it on. This technique divides a dataset into three
subsets: training, validation, and test sets.
From the terms we defined at the start of the article, we know that the training set
helps the model make predictions and that the test set assesses the performance of the
model. The validation set also helps to assess the performance of the model by
providing an environment to fine-tune the parameters of the model. From this, we
select the best performing model.
The holdout method is ideal when dealing with a very large dataset, it prevents model
overfitting, and incurs lower computational costs.
When a function fits too tightly to a set of data points, an error known as overfitting
occurs. As a result, a model performs poorly on unseen data. To detect overfitting, we
could first split our dataset into training and test sets. We then monitor the
performance of the model on both training data and test data.
17
If our model offers superior performance on the training set when compared to the test
set, there’s a good chance overfitting is present. For instance, a model might offer
90% accuracy on the training set yet give 50% on the test set.
Predictions for classification problems yield four types of outcomes: true positives,
true negatives, false positives, and false negatives. We’ll define them later on. We
look at a few metrics for classification problems.
Classification accuracy
The most common evaluation metric for classification problems is accuracy. It’s taken
as the number of correct predictions against the total number of predictions made (or
input samples). However, as much as accuracy is used to evaluate a model, it’s not a
clear indicator of model performance as we stated earlier.
Classification accuracy works best if the samples belonging to each class are equal in
number. Consider a scenario with 97% samples from class X and 3% from class Y in
a training set. A model can very easily achieve 97% training accuracy by predicting
each training sample in class X.
Testing the same model on a test set with 55% samples of X and 45% samples of Y,
the test accuracy is reduced to 55%. This is why classification accuracy is not a clear
indicator of performance. It provides a false sense of attaining high levels of accuracy.
Confusion matrix
The confusion matrix forms the basis for the other types of classification metrics. It’s
a matrix that fully describes the performance of the model. A confusion matrix gives
an in-depth breakdown of the correct and incorrect classifications of each class.
18
Confusion Matrix
The four terms represented in the image above are very important.
From the definition of the four terms above, the takeaway is that it’s important to
amplify true positives and true negatives. False positives and false negatives represent
misclassification, that could be costly in real-world applications. Consider instances
of misdiagnosis in a medical deployment of a model.
A model may wrongly predict that a healthy person has cancer. It may also classify
someone who actually has cancer as cancer-free. Both these outcomes would have
unpleasant consequences in terms of the well being of the patients after being
diagnosed (or finding out about the misdiagnosis), treatment plans as well as expenses.
Therefore it’s important to minimize false negatives and false positives.
The green shapes in the image represent when the model makes the correct prediction.
The blue ones represent scenarios where the model made the wrong predictions. The
19
rows of the matrix represent the actual classes while the columns represent predicted
classes.
We can calculate accuracy from the confusion matrix. The accuracy is given by taking
the average of the values in the “true” diagonal.
That translates to: Accuracy = Total Number of Correct Predictions / Total Number of
Observations
Since the confusion matrix visualizes the four possible outcomes of classification
mentioned above, aside from accuracy, we have insight into precision, recall, and
ultimately, F-score. They can easily be calculated from the matrix. Precision, recall,
and F-score are defined in the section below.
F-score
F-score is a metric that incorporates both the precision and recall of a test to
determine the score. This post defines it as the harmonic mean of recall and precision.
F-score is also known as F-measure or F1 score.
Precision refers to the number of true positives divided by the total positive results
predicted by a classifier. Simply put, precision aims to understand what fraction of all
positive predictions were actually correct.
On the other hand, recall is the number of true positives divided by all the samples
that should have been predicted as positive. Recall has the goal to perceive what
fraction of actual positive predictions were identified accurately.
20
In addition to robustness, the F-score shows us how precise a model is by letting us
know how many correct classifications are made. The F-score ranges between 0 and 1.
The higher the F-score, the greater the performance of the model.
21
ROC Curve and AUC
ROC curve
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False
Positives and True Positives. The following figure shows a typical ROC curve.
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire
two-dimensional area underneath the entire ROC curve (think integral calculus) from
(0,0) to (1,1).
22
Figure 5. AUC (Area under the ROC Curve).
AUC represents the probability that a random positive (green) example is positioned
to the right of a random negative (red) example.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an
AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
23
Metrics for regression problems
Classification models deal with discrete data. The already covered metrics are ideal
for classification tasks since they are concerned with whether a prediction is correct.
There is no in-between.
Regression models, on the other hand, deal with continuous data. Predictions are in a
continuous range. This is the distinction between the metrics for classification and
regression problems.
The mean absolute error represents the average of the absolute difference between the
original and predicted values.
Mean absolute error provides the estimate of how far off the actual output the
predictions were. However, since it’s an absolute value, it does not indicate the
direction of the error.
The mean squared error is quite similar to the mean absolute error. However, as
described by this article, mean squared error uses the average of the square of the
difference between original and predicted values. Since this involves the squaring of
the errors, larger errors are very notable.
The root mean squared error (RMSE), as defined in this post, computes the idealness
of fit by calculating the square root of the average of squared differences between the
predicted and actual values. It’s a measure of the average error magnitude.
24
Improving Performance of a Model
Machine learning development would be not difficult for ML engineers, but ensuring
its performance is important to get accurate and most reliable results. Though, there
are various methods you can improve your machine learning model performance.
Algorithms are the key factor used to train the ML models. The data feed into this that
helps the model to learn from and predict with accurate results. Hence, choosing the
right algorithm is important to ensure the performance of your machine learning
model.
Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, kNN,
K-Means, Random Forest and Dimensionality Reduction Algorithms and Gradient
Boosting are the leading ML algorithms you can choose as per your ML model
compatibility.
The next important factor you can consider while developing a machine learning
model is choosing the right quantity of data sets. And there are multirole factors and
for deep learning-based ML models, a huge quantity of datasets is required for
algorithms.
Depending on the complexities of problem and learning algorithms, model skill, data
size evaluation and use of statistical heuristic rule are the leading factors determine
the quantity and types of training data sets that help in improving the performance of
the model.
Just like quantity, the quality of machine learning training data set is another key
factor, you need to keep in mind while developing an ML model. If the quality of
25
machine learning training data sets is not good or accurate your model will never give
accurate results, affecting the overall performance of the model not suitable to use in
real-life.
Actually, there are different methods to measure the quality of the training data set.
Standard quality-assurance methods and detailed for in-depth quality assessment are
the leading two popular methods you can use to ensure the quality of data sets.
Quality of data is important to get unbiased decisions from the ML models, so you
need to make sure to use the right quality of training data sets to improve the
performance of your ML model.
Supervised or Unsupervised ML
Building a machine learning model is not enough to get the right predictions, as you
have to check the accuracy and need to validate the same to ensure get the precise
results. And validating the model will improve the performance of the ML model.
Actually, there are various types of validation techniques you can follow but you need
to make sure choose the best one that is suitable for your ML model validation and
help you to improve the overall performance of your ML model and predict in an
unbiased manner. Similarly, testing of the model is also important to ensure its
accuracy and performance.
Data tells a story only if you have enough of it. Every data sample provides some
input and perspective to your data's overall story is trying to tell you. Perhaps the
easiest and most straightforward way to improve your model's performance and
increase its accuracy is to add more data samples to the training data.
26
Doing so will add more details to your data and finetune your model resulting in a
more accurate performance. Rember after all, the more information you give your
model, the more it will learn and the more cases it will be able to identify correctly.
Sometimes adding more data couldn’t be the answer to your model inaccuracy
problem. You’re providing your model with a good technique and the correct dataset.
But you’re not getting the results you hope for; why?
Maybe you’re just asking the wrong questions or trying to hear the wrong story.
Looking at the problem from a new perspective can add valuable information to your
model and help you uncover hidden relationships between the story variables. Asking
different questions may lead to better results and, eventually, better accuracy.
More context can always lead to a better understanding of the problem and, eventually,
better performance of the model. Imagine I tell you I am selling a car, a BMW. That
alone doesn’t give you much information about the car. But, if I add the color, model
and distance traveled, then you’ll start to have a better picture of the car and its
possible value.
Training a machine learning model is a skill that you can only hone with practice. Yes,
there are rules you can follow to train your model, but these rules don’t give you the
answer your seeking, only the way to reach that answer.
However, to get the answer, you will need to do some trial and error until you reach
your answer. When I first started learning the different machine learning algorithms,
such as the K-means, I was lost on choosing the best number of clusters to reach the
optimal results. The way to optimize the results is to tune its hyper-parameters. So,
tuning the parameters of the algorithm will always lead to better accuracy.
Using this approach, we can enhance the algorithm's training process but train it using
the different chunks and averaging over the result. Cross-validation is used to
optimize the model’s performance. This approach is very popular because it’s so
simple and easy to implement.
27
Experiment with a different algorithm.
What if you tried all the approaches we talked about so far and your model still results
in a low or average accuracy? What then?
Sometimes we choose an algorithm to implement that doesn’t really apply to the data
we have, and so we don’t get the results we expect. Changing the algorithm, you’re
using to implement your solution. Trying out different algorithms will lead you to
uncover more details about your data and the story it's trying to tell.
Feature Engineering
Feature engineering refers to the process of using domain knowledge to select and
transform the most relevant variables from raw data when creating a predictive model
using machine learning or statistical modeling. The goal of feature engineering and
selection is to improve the performance of machine learning (ML) algorithms.
The feature engineering pipeline is the preprocessing steps that transform raw data
into features that can be used in machine learning algorithms, such as predictive
models. Predictive models consist of an outcome variable and predictor variables, and
it is during the feature engineering process that the most useful predictor variables are
created and selected for the predictive model. Automated feature engineering has been
available in some machine learning software since 2016. Feature engineering in ML
consists of four main steps: Feature Creation, Transformations, Feature Extraction,
and Feature Selection.
28
Feature Selection: Feature selection algorithms essentially analyze, judge,
and rank various features to determine which features are irrelevant and
should be removed, which features are redundant and should be removed, and
which features are most useful for the model and should be prioritized.
The art of feature engineering may vary among data scientists, however steps for how
to perform feature engineering for most machine learning algorithms include the
following:
Each Kaggle competition provides a training data set to train the predictive model,
and a testing data set to work with. The Titanic Competition also provides information
about passengers onboard the Titanic.
Feature Transformation
29
2. It is also known as Feature Engineering, which is creating new features from
existing features that may help in improving the model performance.
3. It refers to the family of algorithms that create new features using the existing
features. These new features may not have the same interpretation as the original
features, but they may have more explanatory power in a different space rather than in
the original space.
4. This can also be used for Feature Reduction. It can be done in many ways, by
linear combinations of original features or by using non-linear functions.
1. Some Machine Learning models, like Linear and Logistic regression, assume that
the variables follow a normal distribution. More likely, variables in real datasets will
follow a skewed distribution.
1. Function Transformation
2. Power Transformation
3. Quantile transformation
Function Transformations
LOG TRANSFORMATION:
– Generally, these transformations make our data close to a normal distribution but
are not able to exactly abide by a normal distribution.
– This transformation is not applied to those features which have negative values.
30
– This transformation is mostly applied to right-skewed data.
– Convert data from addictive Scale to multiplicative scale i,e, linearly distributed
data.
RECIPROCAL TRANSFORMATION
– This transformation reverses the order among values of the same sign, so large
values become smaller and vice-versa.
SQUARE TRANSFORMATION
Power Transformations
– Box-cox requires the input data to be strictly positive(not even zero is acceptable).
– for features that have zeroes or negative values, Yeo-Johnson comes to the rescue.
Quantile transformation
31
The transformation is applied on each feature independently. First an estimate of the
cumulative distribution function of a feature is used to map the original values to a
uniform distribution. The obtained values are then mapped to the desired output
distribution using the associated quantile function. Features values of new/unseen data
that fall below or above the fitted range will be mapped to the bounds of the output
distribution. Note that this transform is non-linear. It may distort linear correlations
between variables measured at the same scale but renders variables measured at
different scales more directly comparable.
Feature Selection is the most critical pre-processing activity in any machine learning
process. It intends to select a subset of attributes or features that makes the most
meaningful contribution to a machine learning activity. In order to understand it, let
us consider a small example i.e. Predict the weight of students based on the past
information about similar students, which is captured inside a ‘Student Weight’
data set. The data set has 04 features like Roll Number, Age, Height & Weight. Roll
Number has no effect on the weight of the students, so we eliminate this feature. So
now the new data set will be having only 03 features. This subset of the data set is
expected to give better results than the full set.
Before proceeding further, we should look at the fact why we have reduced the
dimensionality of the above dataset OR what are the issues in High Dimensional
Data?
1. Having a faster and more cost-effective (less need for computational resources)
learning model
2. Having a better understanding of the underlying model that generates the data.
3. Improving the efficacy of the learning model.
a. Feature Relevance: In the case of supervised learning, the input data set (which is
the training data set), has a class label attached. A model is inducted based on the
training data set — so that the inducted model can assign class labels to new,
unlabeled data. Each of the predictor variables, ie expected to contribute information
to decide the value of the class label. In case of a variable is not contributing any
information, it is said to be irrelevant. In case the information contribution for
32
prediction is very little, the variable is said to be weakly relevant. The remaining
variables, which make a significant contribution to the prediction task are said to be
strongly relevant variables.
In the case of unsupervised learning, there is no training data set or labelled data.
Grouping of similar data instances are done and the similarity of data instances are
evaluated based on the value of different variables. Certain variables do not contribute
any useful information for deciding the similarity of dissimilar data instances. Hence,
those variable makes no significant contribution to the grouping process. These
variables are marked as irrelevant variables in the context of the unsupervised
machine learning task.
We can understand the concept by taking a real-world example: At the start of the
article, we took a random dataset of the student. In that, Roll Number doesn’t
contribute any significant information in predicting what the Weight of a student
would be. Similarly, if we are trying to group together students with similar academic
capabilities, Roll No can really not contribute any information. So, in the context of
grouping students with similar academic merit, the variable Roll No is quite
irrelevant. Any feature which is irrelevant in the context of a machine learning task
is a candidate for rejection when we are selecting a subset of features.
All features having potential redundancy are candidates for rejection in the final
feature subset. Only a few representative features out of a set of potentially redundant
features are considered for being a part of the final feature subset. So in short, the
main objective of feature selection is to remove all features which are irrelevant and
take a representative subset of the features which are potentially redundant. This leads
to a meaningful feature subset in the context of a specific learning task.
33
Where, marginal entropy of the class,
And K = number of classes, C = class variable, f = feature set that take discrete values.
In the case of unsupervised learning, there is no class variable. Hence, feature-to-class
mutual information cannot be used to measure the information contribution of the
features. In the case of unsupervised learning, the entropy of the set of features
without one feature at a time is calculated for all features. Then the features are
ranked in descending order of information gain from a feature and the
top percentage (value of beta is a design parameter of the algorithm) of features
are selected as relevant features. The entropy of a feature f is calculated using
Shannon’s formula below:
is used only for features that take the discrete values. For continuous features, it
should be replaced by discretization performed first to estimate the probabilities
p(f=x).
Correlation-based Measures
Distance-based Measures
Other coefficient-based Measure
where
34
where
Correlation value ranges between +1 and -1. A correlation of 1 (+/-) indicates perfect
correlation. In case the correlation is zero, then the features seem to have no linear
relationship. Generally for all feature selection problems, a threshold value is adopted
to decide whether two features have adequate similarity or not.
The most common distance measure is the Euclidean distance, which, between two
features F1 and F2 are calculated as:
Where the features represent an n-dimensional dataset. Let us consider that the dataset
has two features, Subjects (F1) and marks (F2) under consideration. The Euclidean
distance between the two features will be calculated like this:
measured as
Minkowski distance takes the form of Euclidean distance (also called L2 norm) where
r = 2. At r=1, it takes the form of Manhattan distance (also called L1
norm) :
35
Where = number of cases when both the feature have value 1,
= number of cases where the feature 1 has value 0 and feature 2 has value 1,
= the number of cases where feature 1 has value 1 and feature 2 has value 0.
Jaccard distance:
As shown in the above picture, the cases where both the values are 0 have been left
out without border- as an indication of the fact that they will be excluded in the
calculation of the Jaccard coefficient.
36
Where, x.y is the vector dot product of x and y =
and
Cosine Similarity measures the angle between x and y vectors. Hence, if cosine
similarity has a value of 1, the angles between x and y is 0 degrees which means x and
y are the same except for the magnitude. If the cosine similarity is 0, the angle
between x and y is 900. Hence, they do not share any similarity. In the case of the
above example, the angle comes out to be 43.20.
Even after all these steps, there are some few more steps. You can understand it by
the following flowchart:
37
Feature Selection Process
After the successful completion of this cycle, we get the desired features, and we have
finally tested them also.
1. Filter Method: In this method, features are dropped based on their relation to the
output, or how they are correlating to the output. We use correlation to check if the
features are positively or negatively correlated to the output labels and drop features
accordingly. Eg: Information Gain, Chi-Square Test, Fisher’s Score, etc.
38
Figure 5: Filter Method flowchart
2. Wrapper Method: We split our data into subsets and train a model using this.
Based on the output of the model, we add and subtract features and train the model
again. It forms the subsets using a greedy approach and evaluates the accuracy of all
the possible combinations of features. Eg: Forward Selection, Backwards Elimination,
etc.
3. Intrinsic Method: This method combines the qualities of both the Filter and
Wrapper method to create the best subset.
39
This method takes care of the machine training iterative process while maintaining the
computation cost to be minimum. Eg: Lasso and Ridge Regression.
How do we know which feature selection model will work out for our model? The
process is relatively simple, with the model depending on the types of input and
output variables.
Input Output
Feature Selection Model
Variable Variable
Pearson’s correlation coefficient
Numerical Numerical Spearman’s rank coefficient
40