0% found this document useful (0 votes)
45 views61 pages

Feature Selection

Uploaded by

Aisha Yahya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views61 pages

Feature Selection

Uploaded by

Aisha Yahya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Feature Selection

Outline
• Introduction
• What is Feature selection?
• Is Feature Selection required?
• Motivation for Feature Selection.
• Relevance of Features
• Variable Ranking
• Feature Subset Selection
Introduction
• The volume of data is practically exploding by the
day. Not only this, the data that is available now in
becoming increasingly unstructured.
• An universal problem of intelligent (learning) agents
is where to focus their attention.
• It is very critical to understand “What are the aspects
of the problem at hand are important/necessary to
solve it?”
– i.e. discriminate between the relevant and irrelevant parts
of experience.
What is Feature selection?
(or Variable Selection)
• Problem of selecting some subset of a learning
algorithm’s input variables upon which it should
focus attention, while ignoring the rest.
• In other words, Dimensionality Reduction. As
Humans, we constantly do that!
What is Feature selection?
(or Variable Selection)
• Given a set of features F = { f1 ,…, fi ,…, fn } the
Feature Selection problem is to find a subset
that “maximizes the learner’s ability to classify
patterns”.
• Formally F’ should maximize some scoring
function.
Is Feature Selection required?
Two Thoughts
Motivation for Feature Selection.
• Especially when dealing with a large number of variables
there is a need for Dimensionality Reduction.
• Feature Selection can significantly improve a learning
algorithm’s performance.
• The Curse of Dimensionality
Feature Selection — Optimality?
• In theory, the goal is to find an optimal feature-
subset (one that maximizes the scoring function).
• In real world applications this is usually not
possible.
– For most problems it is computationally intractable to
search the whole space of possible feature subsets.
– One usually has to settle for approximations of the
optimal subset.
– Most of the research in this area is devoted to finding
efficient search-heuristics.
Relevance of Features
• There are several definitions of relevance in
literature.
– Relevance of 1 variable, Relevance of a variable given
other variables, Relevance given a certain learning
algorithm ,..
– Most definitions are problematic, because there are
problems where all features would be declared to be
irrelevant
– This can be defined through two degrees of
relevance: weak and strong relevance.
• A feature is relevant iff it is weakly or strongly
relevant and irrelevant (redundant) otherwise.
Relevance of Features
• Strong Relevance of a variable/feature:
– Let Si = {f1, …, fi-1, fi+1, …fn} be the set of all features
except fi. Denote by si a value-assignment to all
features in Si.
– A feature fi is strongly relevant, iff there exists
some xi, y and si for which p(fi = xi, Si = si) > 0 such
that
– p(Y = y | fi = xi; Si = si) ≠ p(Y = y | Si = si)
– This means that removal of fi alone will always result
in a performance deterioration of an optimal Bayes
classifier.
Relevance of Features
• Weak Relevance of a variable/feature:
– A feature fi is weakly relevant, iff it is not strongly
relevant, and there exists a subset of features Si‘
of Si for which there exists some xi, y and si’ with
p(fi = xi, Si’ = si’) > 0 such that
– p(Y = y | fi = xi; Si’ = si’) ≠ p(Y = y | Si’ = si’)
– This means that there exists a subset of
features Si’, such that the performance of an
optimal Bayes classifier on Si’ is worse
than Si’ U { fi }
Variable Ranking
• Variable Ranking is the process of ordering the
features by the value of some scoring function,
which usually measures feature-relevance.

• Resulting set: The score S(fi) is computed from the


training data, measuring some criteria of feature fi.
By convention a high score is indicative for a valuable
(relevant) feature.
Variable Ranking
• A simple method for feature selection using variable
ranking is to select the k highest ranked features
according to S.

• This is usually not optimal, but often preferable to


other, more complicated methods.

• It is computationally efficient — only calculation and


sorting of n scores.
Ranking Criteria
Correlation Criteria
Information Theoretic Criteria
Ranking Criteria poses some questions
• Can variables with small score be automatically discarded?
• The answer is NO!
• Even variables with small score can improve class seperability.

• Here, this depends on the correlation between x1 and x2


Ranking Criteria poses some questions
• Can a useless variable (i.e. one with a small score) be useful
together with others?
• The answer is YES!

• The correlation between variables and target are not enough


to assess relevance.
• The correlation / co-variance between pairs of variables has to
be considered too (potentially difficult).
• Also, the diversity of features needs to be considered.
Ranking Criteria poses some questions
• Can two variables that are useless by themselves can be
useful together?
• The answer is YES!
• This can be done using the Information Theoretic Criteria.
• Information Theoretic Criteria
• Mutual information can also detect non-linear dependencies
among variables.
• But, it is harder to estimate than correlation.
• It is a measure for “how much information (in terms of
entropy) two random variables share”.
Variable Ranking
Single Variable Classifiers
• Idea: Select variables according to their individual
predictive power
• Criterion: Performance of a classifier built with 1
variable e.g. the value of the variable itself
• The Predictive power is usually measured in terms of
error rate (or criteria using False Positive Rate, False
Negative Rate)
• Also, a combination of SVC’s can be deployed using
ensemble methods (boosting,…).
Feature Subset Selection
The Goal of Feature Subset Selection is to find the optimal
feature subset. Feature Subset Selection Methods can be
classified into three broad categories.
– Filter Methods
– Wrapper Methods
– Embedded Methods
For Feature Subset Selection we need:
– A measure for assessing the goodness of a feature subset
(scoring function)
– A strategy to search the space of possible feature subsets
– Finding a minimal optimal feature set for an arbitrary target
concept is hard. It would need Good Heuristics.
Filter Methods
Feature Subset Selection
• Filter Methods:
– Filter methods select features from a dataset
independently for any machine learning algorithm.
– These methods rely only on the characteristics of
these variables, so features are filtered out of the
data before learning begins.
– These methods are powerful and simple and help to
quickly remove features.
– These are generally the first step in any feature
selection pipeline.
Feature Subset Selection
• Advantages of Filter Methods:
– Selected features can be used in any machine learning
algorithm,
– They’re computationally inexpensive—you can
process thousands of features in a matter of seconds.
– Filter methods are very good for eliminating
irrelevant, redundant, constant, duplicated, and
correlated features.
Feature Subset Selection
• Filtering Methods: There are of 2 of types.
– Univariate
– Multivariate.
Feature Subset Selection
Univariate filter methods
• They evaluate and rank a single feature according to certain
criteria.
• They treat each feature individually and independently of
the feature space.
• This is how it functions in practice:
– It ranks features according to certain criteria.
– Then select the highest ranking features according to those
criteria.
• One problem that can occur with univariate methods is
they may select a redundant variable, as they don’t take
into consideration the relationship between features.
Feature Subset Selection
• Multivariate filter methods, on the other hand, evaluate
the entire feature space.
• They take into account features in relation to other ones
in the dataset.
• These methods are able to handle duplicated, redundant,
and correlated features.
Feature Subset Selection
• Basic Filter Methods:
– Constant Features that show single values in all the
observations in the dataset. These features provide no
information that allows ML models to predict the
target.
– Quasi-Constant Features in which a value occupies
the majority of the records.
– Duplicated Features, which is self-explanatory—the
same feature.
Correlation Filter Methods
• Correlation is defined as a measure of the linear relationship
between two quantitative variables, like height and weight. You
could also define correlation is a measure of how strongly one
variable depends on another.
• A high correlation is often a useful property—if two variables are
highly correlated, we can predict one from the other.
• Therefore, we generally look for features that are highly
correlated with the target, especially for linear machine learning
models.
• However, if two variables are highly correlated among
themselves, they provide redundant information in regards to
the target. Essentially, we can make an accurate prediction on
the target with just one of the redundant variables.
Correlation Filter Methods
• There are a number of methods to measure the
correlation between variables.
• Pearson correlation coefficient: It’s used to summarize
the strength of the linear relationship between two
variables, which can vary between 1 and -1:
– 1 means a positive correlation: the values of one variable
increase as the values of another increase.
– -1 means a negative correlation: the values of one variable
decrease as the values of another increase.
– 0 means no linear correlation between the two variables.
Correlation Filter Methods
• The assumptions of the Pearson correlation coefficient :
• Both variables should be normally distributed.
– A straight-line relationship between the two variables.
– Data is equally distributed around the regression line.
• The formula to calculate the value of the Pearson
correlation coefficient is :

• Sometimes two variables can be related in a nonlinear


relationship, which can be stronger or weaker across the
distribution of the variables.
Correlation Filter Methods
• Spearman’s rank correlation coefficient is a non-parametric
test that’s used to measure the degree of association
between two variables with a monotonic function(increasing
or decreasing relationship).
• The measured strength between the variables using
Spearman’s correlation varies between+1 and −1.
• Spearman’s coefficient is suitable for both continuous and
discrete ordinal variables.
• The Spearman’s rank correlation test doesn’t carry any
assumptions about the distribution of the data.
Correlation Filter Methods
• Kendall’s rank correlation coefficient is a non-parametric
test that measures the strength of the ordinal association
between two variables.
• It calculates a normalized score for the number of
matching or concordant rankings between the two data
samples.
• Kendall’s correlation varies between 1 (high) and -1 (low).
• This type of correlation is best suited for discrete data.

= Concordant: Ordered in the same way


= Discordant: Ordered differently.
Correlation Filter Methods
• Concordant: Ordered in the same way (consistency).
– A pair of observations is considered concordant if (x2 — x1) and
(y2 — y1) have the same sign.
• Discordant: Ordered differently (inconsistency).
– A pair of observations is considered concordant if (x2 — x1) and
(y2 — y1) have opposite signs.
Correlation Filter Methods
• Calculate correlation coefficient for the following data
Correlation Filter Methods
Filter Methods
• Filter methods that tend to select features
independently and work with (essentially) any machine
learning algorithm.
• These methods tend to ignore the effect of the
selected feature subset on the performance of the
algorithm.
• In addition, filter methods often evaluate features
individually. In that case, some variables can be useless
for prediction in isolation, but they can be quite useful
when combined with other variables.
• To prevent those issues, wrapper methods join the
party in selecting the best feature subsets.
Wrapper Methods
Wrapper Methods
• Wrapper methods work by evaluating a subset of
features using a machine learning algorithm that employs
a search strategy to look through the space of possible
feature subsets, evaluating each subset based on the
quality of the performance of a given algorithm.

• These methods are called greedy algorithms because


they aim to find the best possible combination of
features that result in the best performant model—
which will be computationally expensive, and often
impractical in the case of exhaustive search.
Wrapper Methods
• Practically any combination of a search strategy and a
machine learning algorithm can be used as a wrapper.

• Wrapper Methods: Advantages


– They detect the interaction between variables
– They find the optimal feature subset for the desired
machine learning algorithm

• The wrapper methods usually result in better


predictive accuracy than filter methods.
Wrapper Methods: Process
• Search for a subset of features: Using a search method,
we select a subset of features from the available ones.

• Build a machine learning model: a chosen ML algorithm


is trained on the previously-selected subset of features.

• Evaluate model performance: we evaluate the newly-


trained ML model with a chosen metric.

• Repeat: The whole process starts again with a new


subset of features, a new ML model trained, and so on.
Stopping Criteria
• At some point in time, we need to stop searching for a
subset of features.
• To do this, we have to put in place some pre-set
criteria.
• These criteria need to be defined by the machine
learning engineer.
• Here are a couple of examples of these criteria:
– Model performance decreases.
– Model performance increases.
– A predefined number of features is reached.
• The pre-set criteria, Example, can be metrics like ROC-
AUC for classification or RMSE for regression.
AUC-ROC Curve in Machine Learning Clearly Explained
Search methods
• Forward Feature Selection: This method starts with no
feature and adds one at a time.

• Backward Feature Elimination: This method starts with


all features present and removes one feature at the time.

• Exhaustive Feature Selection: This method tries all


possible feature combinations.

• Bidirectional Search: And this last one does both forward


and backward feature selection simultaneously in order
to get one unique solution.
Forward Feature Selection
• Forward feature selection or sequential forward feature
selection(SFS) is an iterative method in which we start by
evaluating all features individually, and then select the
one that results in the best performance.

• In the next step, it tests all possible combinations of the


selected feature with the remaining features and retains
the pair that produces the best algorithmic performance.

• And the loop continues by adding one feature at a time


in each iteration until the pre-set criterion is reached.
Forward Feature Selection
Forward Feature Selection
Backward Feature Elimination
• Backward feature selection or sequential backward
feature selection(SBS). We start with all the features in
the dataset, and then we evaluate the performance of
the algorithm.
• After that, at each iteration, backward feature
elimination removes one feature at a time, which
produces the best performing algorithm using an
evaluation metric.
• This feature can be also described as the least significant
feature among the remaining available ones.
• And it continues, removing feature after feature until a
certain criterion is satisfied.
Backward Feature Elimination
Backward Feature Elimination
Exhaustive Feature Selection
• It searches across all possible feature combinations. Its aim
is to find the best performing feature subset.
• It creates all the subsets of features from 1 to N,
with N being the total number of features, and for each
subset, it builds a machine learning algorithm and selects
the subset with the best performance.
• The parameters that you can play with here are the 1 and
N, which can be described as the minimum number of
features and the maximum number of features.
• That way, we can reduce this method’s computation time if
we choose reasonable numbers for these parameters.
Bidirectional Feature Selection
• Begins the search in both directions, performing SFG and SBG
concurrently.
• They stop in two cases:
– (1) when one search finds the best subset comprised of m features
before it reaches the exact middle.
or
– (2) both searches achieve the middle of the search space. It takes
advantage of both SFG and SBG.
• But this can lead to an issue of converging to a different
solution. To avoid this and to guarantee SFS and SBS converge
to the same solution, we make the following constraints:
– Features already selected by SFS are not removed by SBS.
– Features already removed by SBS are not added by SFS.
Bidirectional Feature Selection
Embedded Methods
Embedded Methods
• Wrapper methods provide a good way to ensure that the
selected features are the best for a specific machine
learning model.
• These methods will provide better results in terms of
performance, but they’ll also cost us a lot of computation
time/resources.
• But what if we could include the feature selection
process in ML model training itself? That could lead us to
even better features for that model, in a shorter amount
of time. This is where embedded methods come into
play.
Embedded Methods
• Embedded methods complete the feature selection
process within the construction of the machine
learning algorithm itself.

• A learning algorithm takes advantage of its own


variable selection process and performs feature
selection and classification/regression at the same
time.
Embedded Methods: Advantages
• They take into consideration the interaction of features
like wrapper methods do.
• They are faster like filter methods.
• They are more accurate than filter methods.
• They find the feature subset for the algorithm being
trained.
• They are much less prone to overfitting.
Embedded Methods: Process
• First, these methods train a machine learning model.

• Then they derive feature importance from that


model, which is a measure of how much is feature
important when making a prediction.

• Finally, they remove non-important features using


the derived feature importance.
Few embedded methods for feature selection

• Regularization in machine learning adds a penalty to


the different parameters of a model to reduce its
freedom.
• This penalty is applied to the coefficient that
multiplies each of the features in the linear model,
and is done to avoid overfitting, make the model
robust to noise, and to improve its generalization.
Types of Regularization
• L1 regularization has shrinks some of the coefficients to zero,
therefore indicating that a certain predictor or certain
features will be multiplied by zero to estimate the target.
Thus, it won’t be added to the final prediction of the target—
this means that these features can be removed because they
aren’t contributing to the final prediction.
Types of Regularization
• L2 regularization, on the other hand, doesn’t
set the coefficient to zero, but only
approaching zero—that’s why we use only L1
in feature selection.
• L1/L2 regularization is a combination of the L1
and L2. It incorporates their penalties, and
therefore we can end up with features with
zero as a coefficient—similar to L1.
Tree-based Feature Importance
• Tree-based algorithms and models (i.e. random forest) are
well-established algorithms that not only offer good
predictive performance but can also provide us with what
we call feature importance as a way to select features.
• Feature importance
– Feature importance tells us which variables are more important in
making accurate predictions on the target variable/class. In other
words, it identifies which features are the most used by the
machine learning algorithm in order to predict the target.
• Random forests provide us with feature importance using
straightforward methods — mean decrease
impurity and mean decrease accuracy

You might also like