Feature selection
Feature selection
Overfitting
● A scenario where the machine learning model
tries to learn from the details along with the
noise in the data and tries to fit each data point
on the curve is called Overfitting.
Underfitting
● A scenario where a machine learning model can
neither learn the relationship between variables
in the testing data nor predict or classify a new
data point is called Underfitting.
Regularization
● Regularization approach prevents the model from overfitting by adding extra
information (training samples) to the training dataset.
● It maintains all variables or features in the model by reducing the magnitude of the
variables for better performance and generalization of the model.
● Primary process is regularizing or reducing the magnitude of the features without
changing the number of features.
Working Principle
● Regularization works by adding a penalty or complexity term to the complex model.
○ Regularization = Loss + λ |w|
Where |w| = |w1| + |w2| + …. + |wn|
● The cost function added with the penalty term bias is called Ridge Regression penalty.
Ridge Regression (L2 regularization)
● It is calculated by multiplying with the lambda the squared weight (coefficient) of
each feature.
● The equation for the cost function
Cost Function = Error(𝑦𝑖, 𝑦𝑖′) = J(𝜃) =
● λ is regularized parameter [0 to 1], M = No. of Samples, n = No of features, 𝜃0 is not
penalized.
● The penalty term regularizes the coefficients of the model.
● If the values of λ tend to zero, the equation becomes the cost function of the linear
regression model.
● Hence, the model will resemble the linear regression model for the minimum value of
λ.
Ridge Regression (L2 regularization) - Example
Cost Function =
(or)
= loss + λ * (slope of the curve)2
For the Linear Regression line, let’s consider two points that are on
the line,
● Loss = 0 (considering the two points on the line)
● λ= 1
● The slope of the curve= 1.4
Cost function = 0 + 1 x (1.4)2 = 1.96
For Ridge Regression, let’s assume,
Loss = 0.32 + 0.22 = 0.13
λ=1
The slope of the curve = 0.7
Then, Cost function = 0.13 + 1 x 0.72 = 0.62
Ridge regression line fits the model more accurately than the linear regression line.
Lasso Regression (L1 regularization)
● Lasso regression is also a regularization technique that reduces the complexity of the
model.
● Lasso regression is defined as combination of Linear regression and L1 norm i.e., |𝜃j|.
● It can shrink the slope to 0 due to absolute values. But Ridge Regression can only
shrink it near to 0.
● The equation for the cost function of Lasso regression:
Cost Function = Error(𝑦𝑖, 𝑦𝑖′) = J(𝜃) =
Lasso regression line fits the model more accurately than the linear regression line.
Difference between Ridge Regression and Lasso
Regression
● Ridge regression is mostly used to reduce the overfitting in the model and includes all
the features present in the model.
● It reduces the complexity of the model by shrinking the coefficients.
● Lasso regression helps reduce the overfitting in the model and feature selection.
Curse of Dimensionality
● The Curse of Dimensionality in Machine Learning arises when working with
high-dimensional data, leading to increased computational complexity, overfitting, and
spurious correlations.
● In high-dimensional spaces, data points become sparse, making it challenging to discern
meaningful patterns or relationships due to the vast amount of data required to adequately
sample the space.
● The Curse of Dimensionality significantly impacts machine learning algorithms in
various ways.
● It leads to increased computational complexity, longer training times, and higher resource
requirements.
● Moreover, it escalates the risk of overfitting and spurious correlations, hindering the
algorithms’ ability to generalize well to unseen data.
Wrapper based methods
● In wrapper methodology, selection of features is done by considering it as a search problem, in which
different combinations are made, evaluated, and compared with other combinations. It trains the
algorithm by using the subset of features iteratively.
● Forward selection -
○ Forward selection is an iterative process, which begins with an empty set of features.
○ After each iteration, it keeps adding on a feature and evaluates the performance to check
whether it is improving the performance or not.
○ The process continues until the addition of a new variable/feature does not improve the
performance of the model.
● Backward elimination -
○ Backward elimination is also an iterative approach, but it is the opposite of forward selection.
○ This technique begins the process by considering all the features and removes the least
significant feature.
○ This elimination process continues until removing the features does not improve the
performance of the model.
Wrapper based methods
● Exhaustive Feature Selection-
○ Exhaustive feature selection is one of the best feature selection methods, which evaluates
each feature set as brute-force.
○ It means this method tries & make each possible combination of features and return the best
performing feature set.
○ Recursive feature elimination is a recursive greedy optimization approach, where features are
selected by recursively taking a smaller and smaller subset of features.
○ Now, an estimator is trained with each set of features, and the importance of each feature is
determined using coef_attribute or through a feature_importances_attribute.
Subset selection
Filter based methods
● In Filter Method, features are selected on the basis of statistics measures.
● This method does not depend on the learning algorithm and filters out the irrelevant feature and
redundant columns from the model by using different metrics through ranking
● Information Gain:
○ Information gain determines the reduction in entropy while transforming the dataset.
○ It can be used as a feature selection technique by calculating the information gain of each
variable with respect to the target variable.
Filter based methods
● Chi-square Test:
○ Chi-square test is a technique to determine the relationship between the categorical variables.
○ The chi-square value is calculated between each feature and the target variable, and the
desired number of features with the best chi-square value is selected.
● Fisher's score:
○ Fisher's score returns the rank of the variable on the fisher's criteria in descending order.
○ Then we can select the variables with a large fisher's score.
● Missing Value Ratio:
○ The value of the missing value ratio can be used for evaluating the feature set against the
threshold value.
○ The variable is having more than the threshold value can be dropped.
Embedded methods
● Embedded methods combined the advantages of both filter and wrapper methods by considering
the interaction of features along with low computational cost. These are fast processing methods
similar to the filter method but more accurate than the filter method.
● Regularization- Regularization adds a penalty term to different parameters of the machine
learning model for avoiding overfitting in the model. This penalty term is added to the coefficients;
hence it shrinks some coefficients to zero. Those features with zero coefficients can be removed
from the dataset. The types of regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
● Random Forest Importance - Different tree-based methods of feature selection help us with
feature importance to provide a way of selecting features. Here, feature importance specifies which
feature has more importance in model building or has a great impact on the target variable.
Random Forest is such a tree-based method, which is a type of bagging algorithm that aggregates a
different number of decision trees. It automatically ranks the nodes by their performance or
decrease in the impurity (Gini impurity) over all the trees. Nodes are arranged as per the impurity
values, and thus it allows to pruning of trees below a specific node. The remaining nodes create a
subset of the most important features.