Feature Selection
Feature Selection
Feature Selection
1. Overview
2. Perspectives
3. Aspects
4. Most Representative Methods
5. Related and Advanced Topics
6. Experimental Comparative Analyses
Feature Selection
1. Overview
2. Perspectives
3. Aspects
4. Most Representative Methods
5. Related and Advanced Topics
6. Experimental Comparative Analyses
Overview
• Why we need FS:
1. to improve performance (in terms of speed, predictive power,
simplicity of the model).
2. to visualize the data for model selection.
3. To reduce dimensionality and remove noise.
• Sequential Backward Generation (SBG): It starts with a full set of features and,iteratively,
they are removed one at a time. Here, the criterion must point out the worst or least
important feature. By the end, the subset is only composed of a unique feature, which is
considered to be the most informative of the whole set. As in the previous case, different
stopping criteria can be used.
Perspectives:
Search of a Subset of Features
• Search Directions:
Perspectives:
Search of a Subset of Features
• Search Directions:
Perspectives:
Search of a Subset of Features
• Search Directions:
• Bidirectional Generation (BG): Begins the search in both directions, performing SFG and SBG
concurrently. They stop in two cases: (1) when one search finds the best subset comprised of
m features before it reaches the exact middle, or (2) both searches achieve the middle of the
search space. It takes advantage of both SFG and SBG.
• Random Generation (RG): It starts the search in a random direction. The choice of adding or
removing a features is a random decision. RGtries to avoid the stagnation into a local optima
by not following a fixed way for subset generation. Unlike SFG or SBG, the size of the subset
of features cannot be stipulated.
Perspectives:
Search of a Subset of Features
• Search Directions:
Perspectives:
Search of a Subset of Features
• Search Directions:
Perspectives:
Search of a Subset of Features
• Search Strategies:
• Exhaustive Search: It corresponds to explore all possible subsets to find the optimal ones.
the space complexity is O(2M). If we establish a threshold m of minimum features to be
selected and the direction of search, the search space is, independent of the forward or
backward generation. Only exhaustive search can guarantee the optimality. Nevertheless,
they are also impractical in real data sets with a high M.
• Heuristic Search: It employs heuristics to carry out the search. Thus, it prevents brute force
search, but it will surely find a non-optimal subset of features. It draws a path connecting the
beginning and the end of the previous Figure, such in a way of a depth-first search. The
maximum length of this path is M and the number of subsets generated is O(M). The choice
of the heuristic is crucial to find a closer optimal subset of features in a faster operation.
Brute force
brute force algorithm is a simple and straightforward approach to solve a problem by trying every
possible solution until finding the best one.It does not use any clever tricks or shortcuts to reduce
the search space or improve the efficiency.
Example
Brute force algorithms are a type of algorithm that can be used to solve certain types of problems,
such as searching for an element in a list or array, sorting a list or array, calculating the factorial of a
number, and calculating the nth term of the Fibonacci sequence
advantages of brute force algorithms is that they are easy to understand and implement. You do
not need to have a deep knowledge of the problem domain or use complex data structures or
techniques. You can simply follow a logical and systematic process to check every possible solution.
Disadvantages of brute force algorithms
One of the main disadvantages of brute force algorithms is that they are very inefficient and time-
consuming. They can consume a lot of computational resources, such as memory, CPU, or network
bandwidth, depending on the size and complexity of the problem.
Perspectives:
Search of a Subset of Features
• Search Strategies:
• Nondeterministic Search: Complementary combination of the previous two.
It is also known as random search strategy and can generate best subsets
constantly and keep improving the quality of selected features as time goes
by. In each step, the next subset is obtained at random.
• it is unnecessary to wait until the search ends.
• we do not know when the optimal set is obtained, although we know which one is better
than the previous one and which one is the best at the moment.
Perspectives:
Selection Criteria
• Information Measures.
• Information serves to measure the uncertainty of the receiver when she/he receives a
message.
• Shannon’s Entropy:
• Information gain:
Perspectives:
Selection Criteria
• Distance Measures.
• Measures of separability, discrimination or divergence measures . The most typical is
derived from distance between the class conditional density functions.
Perspectives:
Selection Criteria
• Dependence Measures.
• known as measures of association or correlation.
• Its main goal is to quantify how strongly two variables are correlated or present some
association with each other, in such way that knowing the value of one of them, we can
derive the value for the other.
• Pearson correlation coefficient:
Perspectives:
Selection Criteria
• Consistency Measures.
• They attempt to find a minimum number of features that separate classes as the
full set of features can.
• An inconsistency is defined as the case of two examples with the same inputs
(same feature values) but with different output feature values (classes in
classification).
Perspectives:
Selection Criteria
• Accuracy Measures.
• This form of evaluation relies on the classifier or learner. Among various possible subsets
of features, the subset which yields the best predictive accuracy is chosen
Perspectives
• Filters:
Perspectives
• Filters:
• measuring uncertainty, distances, dependence or consistency is usually
cheaper than measuring the accuracy of a learning process. Thus, filter
methods are usually faster.
• it does not rely on a particular learning bias, in such a way that the selected
features can be used to learn different models from different DM techniques.
• it can handle larger sized data, due to the simplicity and low time complexity
of the evaluation measures.
Perspectives
• Wrappers:
Perspectives
• Wrappers:
• can achieve the purpose of improving the particular learner’s predictive
performance.
• usage of internal statistical validation to control the overfitting, ensembles of
learners and hybridizations with heuristic learning like Bayesian classifiers or
Decision Tree induction.
• filter models cannot allow a learning algorithm to fully exploit its bias,
whereas wrapper methods do.
Perspectives
• Embedded FS:
• similar to the wrapper approach in the sense that the features are specifically
selected for a certain learning algorithm, but in this approach, the features
are selected during the learning process.
• they could take advantage of the available data by not requiring to split the
training data into a training and validation set; they could achieve a faster
solution by avoiding the re-training of a predictor for each feature subset
explored.
Feature Selection
1. Overview
2. Perspectives
3. Aspects
4. Most Representative Methods
5. Related and Advanced Topics
6. Experimental Comparative Analyses
Aspects:
Output of Feature Selection
• Feature Ranking Techniques:
• we expect as the output a ranked list of features which are ordered according
to evaluation measures.
• they return the relevance of the features.
• For performing actual FS, the simplest way is to choose the first m features for
the task at hand, whenever we know the most appropriate m value.
Aspects:
Output of Feature Selection
• Feature Ranking Techniques:
Aspects:
Output of Feature Selection
• Minimum Subset Techniques:
• The number of relevant features is a parameter that is often not known by
the practitioner.
• There must be a second category of techniques focused on obtaining the
minimum possible subset without ordering the features.
• whatever is relevant within the subset, is otherwise irrelevant.
Aspects:
Output of Feature Selection
• Minimum Subset Techniques:
Aspects:
Evaluation
• Goals:
• Inferability: For predictive tasks, considered as an
improvement of the prediction of unseen examples with
respect to the direct usage of the raw training data.
• Interpretability: Given the incomprehension of raw data by
humans, DM is also used for generating more
understandable structure representation that can explain
the behavior of the data.
• Data Reduction: It is better and simpler to handle data
with lower dimensions in terms of efficiency and
interpretability.
Aspects:
Evaluation
• We can derive three assessment measures from these
three goals:
• Accuracy
• Complexity