0% found this document useful (0 votes)
45 views

Unit 3 ML

This document discusses statistical learning theory and feature selection techniques in machine learning. It introduces statistical learning theory, the types of data it uses (dependent and independent variables), and how statistical models define relationships between variables. It then discusses principal component analysis (PCA) as a feature extraction technique that performs dimensionality reduction by converting correlated variables into linearly uncorrelated principal components. Finally, it discusses different feature selection techniques including filter, wrapper and embedded methods for selecting relevant features and removing redundant or irrelevant ones.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Unit 3 ML

This document discusses statistical learning theory and feature selection techniques in machine learning. It introduces statistical learning theory, the types of data it uses (dependent and independent variables), and how statistical models define relationships between variables. It then discusses principal component analysis (PCA) as a feature extraction technique that performs dimensionality reduction by converting correlated variables into linearly uncorrelated principal components. Finally, it discusses different feature selection techniques including filter, wrapper and embedded methods for selecting relevant features and removing redundant or irrelevant ones.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT -3

Topics: Introduction to Statistical Learning Theory, Feature extraction -


Principal component analysis, Singular value decomposition. Feature
selection – feature ranking and subset selection, filter, wrapper and
embedded methods, Evaluating Machine Learning algorithms and Model
Selection.

Introduction to Statistical Learning Theory


Statistical learning theory is a framework for machine learning that draws from
statistics and functional analysis. It deals with finding a predictive function based on
the data presented. The main idea in statistical learning theory is to build a model that
can draw conclusions from data and make predictions.

Types of Data in Statistical Learning:

With statistical learning theory, there are two main types of data:

 Dependent Variable — a variable (y) whose values depend on the values of other
variables (a dependent variable is sometimes also referred to as a target variable)

 Independent Variables — a variable (x) whose value does not depend on the values of
other variables (independent variables are sometimes also referred to as predictor
variables, input variables, explanatory variables, or features)

A common examples of an Independent Variable is Age. There is nothing that one can do
to increase or decrease age. This variable is independent.

Some common examples of Dependent Variables are:

 Weight — a person’s weight is dependent on his or her age, diet, and activity levels (as
well as other factors)

 Temperature — temperature is impacted by altitude, distance from equator (latitude)


and distance from the sea
In graphs, the independent variable is often plotted along the x-axis while the dependent
variable is plotted along the y-axis.

In this example, which shows how the price of a home is affected by the size of the home,
sq. ft is the independent variable while price of the home is the dependent variable.

Statistical Model:

A statistical model defines the relationships between a dependent and independent


variable. In the above graph, the relationships between the size of the home and the price
of the home is illustrated by the straight line. We can define this relationship by using

y = mx + c

where m represents the gradient and c is the intercept. Another way that this equation
can be expressed is with roman numerals which would look something like:

If we suppose that the size of the home is not the only independent variable when
determining the price and that the number of bathrooms is also an independent variable,
the equation would look like:
Model Generalization:

In order to build an effective model, the available data needs to be used in a way that
would make the model generalizable for unseen situations. Some problems that occur
when building models is that the model under-fits or over-fits to the data.

 Under-fitting — when a statistical model does not adequately capture the underlying
structure of the data and, therefore, does not include some parameters that would
appear in a correctly specified model.

 Over-fitting — when a statistical model contains more parameters that can be


justified by the data and includes the residual variation (“noise”) as if the variation
represents underlying model structure.
Feature extraction - Principal component analysis

 Principal Component Analysis is an unsupervised learning algorithm that is


used for the dimensionality reduction in machine learning
 It is a statistical process that converts the observations of correlated features
into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal
Components.
 PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality.
 Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels.
 It is a feature extraction technique, so it contains the important variables and
drops the least important variable.

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables present in the given


dataset. More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed. The
correlation value ranges from -1 to +1. Here, -1 occurs if variables are
inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and
hence the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given.
Then v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of
variables is called the Covariance Matrix.

The PCA algorithm is based on some mathematical concepts such as:35.8M 698

o Variance and Covariance


o Eigenvalues and Eigen factors

Principal Components in PCA


As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than the
original features present in the dataset. Some properties of these principal
components are given below:

o The principal component must be the linear combination of the original


features.
o These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
o The importance of each component decreases when going to 1 to n, it means
the 1 PC has the most importance, and n PC will have the least importance.

Steps for PCA algorithm


1.Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y,
where X is the training set, and Y is the validation set.

2.Representing data into a structure


Now we will represent our dataset into a structure. Such as we will represent the
two-dimensional matrix of independent variable X. Here each row corresponds to the
data items, and the column corresponds to the Features. The number of columns is
the dimensions of the dataset.
3.Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the
features with high variance are more important compared to the features with lower
variance.If the importance of features is independent of the variance of the feature,
then we will divide each data item in a column with the standard deviation of the
column. Here we will name the matrix as Z.

4.Calculating the Covariance of Z


To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of
Z.
5.Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the directions of the
axes with high information. And the coefficients of these eigenvectors are defined as
the eigenvalues.

6.Sorting the Eigen Vectors


In this step, we will take all the eigenvalues and will sort them in decreasing order,
which means from largest to smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.

7.Calculating the new features Or Principal Components


Here we will calculate the new features. To do this, we will multiply the P* matrix to
the Z. In the resultant matrix Z*, each observation is the linear combination of
original features. Each column of the Z* matrix is independent of each other.

8.Remove less or unimportant features from the new dataset.


The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.
Applications of Principal Component Analysis
o PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.

Feature Selection Techniques in Machine


Learning
Feature selection is a way of selecting the subset of the most relevant features from
the original features set by removing the redundant, irrelevant, or noisy features.

Feature selection is one of the important concepts of machine learning, which highly
impacts the performance of the model. As machine learning works on the concept of
"Garbage In Garbage Out", so we always need to input the most appropriate and
relevant dataset to the model in order to get a better result.

In this topic, we will discuss different feature selection techniques for machine
learning. But before that, let's first understand some basics of feature selection.

o What is Feature Selection?


o Need for Feature Selection
o Feature Selection Methods/Techniques
o Feature Selection statistics

What is Feature Selection?


Each machine learning process depends on feature engineering, which mainly
contains two processes; which are Feature Selection and Feature Extraction. Although
feature selection and extraction processes may have the same objective, both are
completely different from each other. The main difference between them is that
feature selection is about selecting the subset of the original feature set, whereas
feature extraction creates new features. Feature selection is a way of reducing the
input variable for the model by using only relevant data in order to reduce overfitting
in the model.

1.7M

Twitter Announces Elon Musk Will Join the Company's Board of Directors

So, we can define feature Selection as, "It is a process of automatically or


manually selecting the subset of most appropriate and relevant features to be
used in model building." Feature selection is performed by either including the
important features or excluding the irrelevant features in the dataset without
changing them.

Need for Feature Selection


Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that it can be easily interpreted by
the researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.

Feature Selection Techniques


There are mainly two types of Feature Selection techniques, which are:

o Supervised Feature Selection technique


Supervised Feature selection techniques consider the target variable and can be used
for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used
for the unlabelled dataset.
There are mainly three techniques under supervise
supervisedd feature Selection:

1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with
this feature set, the model has trained again.

Some techniques of wrapper methods are:

o Forward selection - Forward selection is an iterative process, which begins with an


empty set of features. After each iteration, it keeps adding on a feature and evaluates
the performance to check whether it is improving the performance or not. The
process continues until the addition of a new varia
variable/feature
ble/feature does not improve the
performance of the model.
o Backward elimination - Backward elimination is also an iterative approach, but it is
the opposite of forward selection. This technique begins the process by considering
all the features and removes the least significant feature. This elimination process
continues until removing the features does not improve the performance of the
model.
o Exhaustive Feature Selection
Selection- Exhaustive feature selection is one of the best feature
selection methods, which eva
evaluates each feature set as brute-force.
force. It means this
method tries & make each possible combination of features and return the best
performing feature set.
o Recursive Feature Elimination
Elimination-
Recursive feature elimination is a recursive greedy optimization appr
approach, where
features are selected by recursively taking a smaller and smaller subset of features.
Now, an estimator is trained with each set of features, and the importance of each
feature is determined using coef_attribute or through a feature_importances_attribute.
feature_importances

2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This
method does not depend on the learning algorithm and chooses the features as a
pre-processing step.

The filter method filters out the irrelevant feature and redundant columns from the
model by using different metrics through ranking.

The advantage of using filter methods is that it needs low computational time and
does not overfit the data.

Some common techniques of Filter methods are as follows


follows:

o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio

Information Gain: Information gain determines the reduction in entropy while


transforming the dataset. It can be used as a feature selection technique by
calculating the information gain of each variable with respect to the target variable.
Chi-square Test: Chi-square
square test is a technique to determine the relationship
between the categorical variables. The chi
chi-square
square value is calculated between each
feature and the target variable, and the desired number of features with the best chi-
chi
square value is selected.

Fisher's Score:

Fisher's score is one of the popular supervised technique of features selection. It


returns the rank of the variable on the fisher's criteria in descending order. Then we
can select the variables with a large fisher's score.

Missing Value Ratio:

The value of the missing value ratio can be used for evaluating the feature set
against the threshold value. The formula for obtaining the missing value ratio is the
number of missing values in each column divided by the total number of
observations. The variable is having more than the threshold value can be dropped.

3. Embedded Methods
 Embedded methods combined the advantages of both filter and wrapper
methods
 These are fast processing methods similar to the filter method but more
accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:

o Regularization- Regularization adds a penalty term to different


parameters of the machine learning model for avoiding overfitting in the
model. This penalty term is added to the coefficients;
o The types of regularization techniques are L1 Regularization (Lasso
Regularization) , L2 (Ridge regularization), Elastic Nets (L1+L2 regularization).

What is L1 Regularization?

L1 regularization is the preferred choice when having a high number of features as it


provides sparse solutions.
The regression model that uses L1 regularization technique is called Lasso
Regression.

Mathematical Formula for L1 regularization


For instance, we define the simple linear regression model Y with an independent
variable to understand how L1 regularization works.

W-Weight
B-Bias

W= w1, w2, w3, ......... wn


And,
b=b1, b2, b3, ......... bn
And Ŷ is the predicted result such that
Ŷ= w1 x1 +w2 x2 +......+wn xn, + b
The below function calculates an error without the regularization function
Loss= Error (Y, Ŷ)
And function that can calculate the error with L1 regularization function,
Where 𝝺 is called the regularization parameter

What is L2 regularization?

L2 regularization can deal with the multicollinearity (independent variables are


highly correlated) problems through constricting the coefficient and by keeping
all the variables.
L2 regression can be used to estimate the significance of predictors and based
on that it can penalize the insignificant predict
predictors.
A regression model that uses L2 regularization techniques is called Ridge
Regression.

Mathematical Formula for L2 regularization


For instance, we define the simple li
linear
near regression model Y with an
independent variable to understand how L2 regularization works.
For this model, W and b represents “weight” and “bias” respectively, such as
W= w1, w2, w3, ......... wn
And,
b=b1, b2, b3, ......... bn

And Ŷ is the predicted result such that


Ŷ= w1 x1 +w2 x2 +......+wn xn, + b
The below function calculates an error without the
regularization function
Loss= Error (Y, Ŷ)
Putting the L2 formula in the above equation;

(theta is slope of the line)

o Random Forest Importance - Random Forest is such a tree-


tree
based method,that aggregates a different number of decision trees. It
automatically ranks the nodes by their performance or decrease in the
impurity (Gini impurity) over all the trees. Nodes are arranged as per the
impurity values, and thus it allows to pruning of trees be
below
low a specific node.
The remaining nodes create a subset of the most important features.

Evaluation and Selection of Models in Machine

Learning
Model evaluation is a method of assessing the correctness of models on test data.
The test data consists of data points that have not been seen by the model before.

Model selection is a technique for selecting the best model after the individual
models are evaluated based on the required criteria.

How to evaluate ML models?


Models can be evaluated using multiple memetrics.
trics. However, the right choice of an
evaluation metric is crucial and often depends upon the problem that is being solved.

Evaluation methods for Classification based problems:

1)Classification metrics

For every classification model prediction, a matrix called the confusion matrix can
be constructed which demonstrates the number of test cases correctly and
incorrectly classified.
It looks something like this (considering 1 -Positive and 0 -Negative
Negative are the target
classes):

Actual 0 Actual 1

Predicted 0 True Negatives (TN) False Negatives (FN)

Predicted 1 False Positives (FP) True Positives (TP)

 TN: Number of negative cases correctly classified


 TP: Number of positive cases correctly classified
 FN: Number of positive cases incorrectly classified as negative
 FP: Number of negative cases correctly classified as positive

2)Accuracy

Accuracy is the simplest metric and can be defined as the number of test cases
correctly classified divided by the total number of test cases.

It can be applied to most generic problems but is not very useful when it comes to
unbalanced datasets.

For instance, if we are detecting frauds in bank data, the ratio of fraud to non
non-fraud
cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be
99% accurate by predicting all test cases as non
non-fraud.
fraud. The 99% accurate model will
be completely useless.

Therefore, for such a case, a metric is required that can focus on the ten fraud data
points which were completely missed by the model.

3.Precision

Precision is the metric used to identify the correctness of classification.

Intuitively, this equation is the ratio of correct positive classifications to the total
number of predicted positive classifications. The greater the fraction, the hig
higher is
the precision, which means better is the ability of the model to correctly classify the
positive class.
4.Recall

Recall tells us the number of positive cases correctly identified out of the total
number of positive cases.

Going back to the fraud problem, the recall value will be very useful in fraud cases
because a high recall value will indicate that a lot of fraud cases were identified out
of the total number of frauds.

5.F1 Score

It is useful in cases where both recall and precision can be valu


valuable
able – like in the
identification of plane parts that might require repairing.

Here, precision will be required to save on the company’s cost (because plane parts
are extremely expensive)

Recall
ecall will be required to ensure that the machinery is stable and not a threat to
human lives.

6.AUC-ROC

ROC curve is a plot of true positive rate (recall) against false positive rate (TN /
(TN+FP)). AUC-ROC
ROC stands for Area Under the Receiver Operating
Characteristics and the higher the area, the better is the model performance.

If the curve is somewhere near the 50% diagonal line, it suggests that the model
randomly predicts the output variable.
7.Log Loss

Log loss is a very effective classification metric and is equivalent to -1* log
(likelihood function) where the likelihood function suggests how likely the model
thinks the observed set of outcomes was.

Since the likelihood function provides very small values, a better way to interpret
them is by converting the values to log and the negative is added to reverse the
order of the metric such that a lower loss score suggests a better model.

8.Gain and Lift Charts

Gain and lift charts are tools that evaluate model performance just like the confusion
matrix but with a subtle, yet significant difference. The confusion matrix determines
the performance of the model on the whole population or the entire test set, whereas
the gain and lift charts evaluate the model on portions of the whole population.
Therefore, we have a score (y-axis) for every % of the population (x-axis).

Lift charts measure the improvement that a model brings in compared to random
predictions. The improvement is referred to as the ‘lift’.

9.K-S Chart

The K-S chart or Kolmogorov-Smirnov chart determines the degree of separation


between two distributions – the positive class distribution and the negative class
distribution. The higher the difference, the better is the model at separating the
positive and negative cases.
Feature Evaluation for Regression b
based
ased problems

Regression models provide a continuous output variable, unlike classification models


that have discrete output variables. Therefore, the metrics for assessing the
regression models are accordingly designed.

1.Mean
Mean Squared Error or MSE

MSE is a simple metric that calculates the difference between the actual value and
the predicted value (error), squares it and then provides the mean of all the errors.

MSE is very sensitive to outliers and will show a very high error value even if a few
outliers
rs are present in the otherwise well
well-fitted model predictions.

2.Root
Root Mean Squared Error or RMSE

RMSE is the root of MSE and is beneficial because it helps to bring down the scale
of the errors closer to the actual values, making it more interpretable.

3.Mean
Mean Absolute Error or MAE

MAE is the mean of the absolute error values (actuals – predictions).

If one wants to ignore the outlier values to a certain degree, MAE is the choice since
it reduces the penalty of the outliers significantly with the removal of the square
terms.

4.Root
Root Mean Squared Log Error or RMSLE

In RMSLE, the same equation as that of RMSE is followed except for an added log
function along with the actual and predicted values.
x is the actual value and y is the predicted value. This helps to scale down the effect
of the outliers by downplaying the higher error rates with the log function. Also,
RMSLE helps to capture a relative error (by comparing all the error values) through
the use of logs.

5.R-Squared

R-Square
Square helps to identify the proportion of variance of the target variable that can
be captured with the help of the independent variables or predictors.

Feature Evaluation for Clustering metrics


Clustering algorithms predict groups of datapoints and hence, distance
distance-based
metrics are most effective.

1.Dunn Index

Dunn Index focuses on identifying clusters that have low variance (among all
members in the cluster) and are compact. The mean values of the different clusters
also need to be far apart.

 δ(Xi, Yj) is the intercluster distance i.e. the distance between Xi and Xj
 ∆(Xk)
(Xk) is the intercluster distance of cluster Xki.e.distance within the cluster Xk

However, the disadvantage of Dunn index is that with a higher number of clusters
and more dimensions, the computation cost increases.

2.Silhouette Coefficient

Silhouette Coefficient tracks how every point in one cluster is close to every point in
the other clusters in the range of -1 to +1.:

 Higher Silhouette values (close


(closerr to +1) indicate that the sample points from
two different clusters are far away.
 0 indicates that the points are close to the decision boundary
 and values closer to -11 suggests that the points have been incorrectly
assigned to the cluster.

3.Elbow method

The elbow method is used to determine the number of clusters in a dataset by


plotting the number of clusters on the xx-axis
axis against the percentage of variance
explained on the y-axis. The point in x-axis where the curve suddenly bends (the
elbow) is considered to suggest the optimal number of clusters.

Types of model selection


Resampling methods

Resampling methods, as the name suggests, are simple techniques of rearranging


data samples to inspect if the model performs well on data samples that it has not
been trained on. In other words, resampling helps us understand if the model will
generalize well.

1)Random Split

Random Splits are used to randomly sample a percentage of data into training,
testing, and preferably validation sets. The advantage of this method is that there is
a good chance that the original population is well represented in all the three sets. In
more formal terms, random splitting will prevent a biased sampling of data.
2)Time-Based Split

There are some types of data where random splits are not possible. For
example, if we have to train a model for weather forecasting, we cannot randomly
divide the data into training and testing sets. This will jumble up the seasonal pattern!
Such data is often referred to by the term – Time Series.

In such cases, a time-wise split is used. The training set can have data for the last
three years and 10 months of the present year. The last two months can be reserved
for the testing or validation set.

3)K-Fold Cross-Validation

The cross-validation technique works by randomly shuffling the dataset and then
splitting it into k groups. Thereafter, on iterating over each group, the group needs
to be considered as a test set while all other groups are clubbed together into the
training set. The model is tested on the test group and the process continues for k
groups.

Thus, by the end of the process, one has k different results on k different test groups.
The best model can then be selected easily by choosing the one with the highest
score.

4.Bootstrap

Bootstrap is one of the most powerful ways to obtain a stabilized model. It is close to
the random splitting technique since it follows the concept of random sampling.

The first step is to select a sample size (which is usually equal to the size of the
original dataset). Thereafter, a sample data point must be randomly selected from
the original dataset and added to the bootstrap sample. After the addition, the
sample needs to be put back into the original sample. This process needs to be
repeated for N times, where N is the sample size.

Therefore, it is a resampling technique that creates the bootstrap sample by


sampling data points from the original dataset with replacement. This means that
the bootstrap sample can contain multiple instances of the same data point.

The model is trained on the bootstrap sample and then evaluated on all those data
points that did not make it to the bootstrapped sample. These are called the out-of-
bag samples.

Feature Selection for Probabilistic measures


Probabilistic Measures do not just take into account the model performance but
also the model complexity. Model complexity is the measure of the model’s ability
to capture the variance in the data.
For example, a highly biased model like the linear regression algorithm is less
complex and on the other hand, a neural network is very high on complexity.

Another important point to note here is that the model performance taken into
account in probabilistic measures is calculated from the training set only.
only A hold-
out test set is typically not required.

A fair bit of disadvantage however lies in the fact that probabilistic measures do not
consider the uncertainty of the models and has a chance of selecting simpler models
over complex models.

1)Akaike
Akaike Information Criterion (AIC)

It is common knowledge that every model is not completely accurate. There is


always some information loss whi which
ch can be measured using the KL information
metric. Kulback-Liebler
Liebler or KL divergence is the measure of the difference in the
probability distribution of two variables.

 K = number of independent variables or predictors


 L = maximum-likelihood
likelihood of the model
 N = number of data points in the training set (especially helpful in case of
small datasets)

The limitation of AIC is that it is not very good with generalizing models as it tends to
select complex models that lose less training information.

2)Bayesian Information
formation Criterion (BIC)

BIC was derived from the Bayesian probability concept and is suited for models that
are trained under the maximum likelihood estimation.

 K = number of independent variables


 L = maximum-likelihood
likelihood
 N = Number of sampler/data points in the training set

BIC penalizes the model for its complexity and is preferably used when the size of
the dataset is not very small (otherwise it tends to settle on very simple models).

3)Minimum
Minimum Description Length (MDL)

MDL is derived from the InfInformation


ormation theory which deals with quantities such as
entropy that measure the average number of bits required to represent an event from
a probability distribution or a random variable.
MDL or the minimum description length is the minimum number of such bit
bits required
to represent the model.

 d = model
 D = predictions made by the model
 L(h) = number of bits required to represent the model
 L(D | h) = number of bits required to represent the predictions from the model

You might also like