0% found this document useful (0 votes)
42 views

A Review of Feature Selection and Its Methods

This document provides an overview of feature selection methods for dimensionality reduction. It discusses how feature selection can help address the curse of dimensionality by removing irrelevant and redundant features. The document reviews different categories of feature selection methods, including filter methods, wrapper methods, and embedded methods. It also outlines the typical steps in a feature selection process, including determining the search direction and strategy, evaluating features, setting stopping criteria, and validating results.

Uploaded by

Ryder Feat Genzo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

A Review of Feature Selection and Its Methods

This document provides an overview of feature selection methods for dimensionality reduction. It discusses how feature selection can help address the curse of dimensionality by removing irrelevant and redundant features. The document reviews different categories of feature selection methods, including filter methods, wrapper methods, and embedded methods. It also outlines the typical steps in a feature selection process, including determining the search direction and strategy, evaluating features, setting stopping criteria, and validating results.

Uploaded by

Ryder Feat Genzo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A Review of Feature Selection and Its Methods

B. Venkatesh, J. Anuradha
SCOPE, Vellore Institute of Technology, Vellore, TN, 632014, India
E-mail: [email protected]

Abstract: Nowadays, being in digital era the data generated by various applications
are increasing drastically both row-wise and column wise; this creates a bottleneck
for analytics and also increases the burden of machine learning algorithms that work
for pattern recognition. This cause of dimensionality can be handled through
reduction techniques. The Dimensionality Reduction (DR) can be handled in two
ways namely Feature Selection (FS) and Feature Extraction (FE). This paper focuses
on a survey of feature selection methods, from this extensive survey we can conclude
that most of the FS methods use static data. However, after the emergence of IoT and
web-based applications, the data are generated dynamically and grow in a fast rate,
so it is likely to have noisy data, it also hinders the performance of the algorithm.
With the increase in the size of the data set, the scalability of the FS methods becomes
jeopardized. So the existing DR algorithms do not address the issues with the dynamic
data. Using FS methods not only reduces the burden of the data but also avoids
overfitting of the model.
Keywords: Dimensionality Reduction (DR), Feature Selection (FS), Feature
Extraction (FE).

1. Introduction
As the data increases exponentially the quality of data required for processing by
Data mining, Pattern Recognition, Image processing, and other Machine Learning
algorithms decrease gradually. Bellman calls this scenario “Curse of
Dimensionality”. Higher dimension data leads to the prevalence of noisy, irrelevant
and redundant data. Which intern causes overfitting of the model and increases the
error rate of the learning algorithm. To handle these problems “Dimensionality
Reduction” techniques are applied, and it is the part of the preprocessing stage. So,
Feature Selection (FS) and Feature Extraction (FE) are most commonly using
dimensionality reduction approaches. FS is used to clean up the noisy, redundant and
irrelevant data. As a result, the performance is boosted.
In FS a subset of features are selected from the original set of features based on
features redundancy and relevance. Based on the relevance and redundant features,
Y u and L i u [1] in 2004 have classified the feature subset as four types. They are:
1) Noisy & irrelevant; 2) Redundant & Weakly relevant; 3) Weakly relevant and
Non-redundant; 4) Strongly relevant. The feature which did not require for predicting 3
accuracy is known as an irrelevant feature. Some of the popular approaches that fit
into filter and wrapper methods are models, search strategies, feature quality
measures, and feature evaluation.
Set of features are key factors for determining the hypothesis of the predicting
models. The No of features and the hypothesis space are directly proportional to each
other, i.e., as the number of features increases, then the hypothesis space is also
increased. For example, if there are M features with the binary class label in a dataset,
M
then it has 2 2 in the search space. The hypothesis space can further be reduced by
discarding redundant and irrelevant features.
The relevancy of the feature is measured based on the characteristics of the data
not by its value. Statistics is such one technique which shows the relationship between
the features and its importance.
The distortion of irrelevant and redundant features is not due to the presence of
un-useful information; it is because the features did not have a statistical relationship
with other features. Individually any feature may be irrelevant but it is relevant when
joined with other features [2].

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


FS methods are classified into three types, based on the interaction with the
learning model such as Filter, Wrapper and Embedded Methods. In the Filter method
features are selected based on statistical measures. It is independent of the learning
algorithm and requires less computational time. Information gain [3], chi-square test
[3], Fisher score, correlation coefficient, and variance threshold are some of the
statistical measures used to understand the importance of the features. The
performance of the Wrapper method depends on the classifier. The best subset of
features is selected based on the results of the classifier. Wrapper methods are
computationally more expensive than filter methods, due to the repeated learning
steps and cross-validation. However, these methods are more accurate than the filter
method. Some of the examples are Recursive feature elimination [4], Sequential
feature selection algorithms [5], and Genetic algorithms. The third approach is the
Embedded method which uses ensemble learning and hybrid learning methods for
feature selection. Since it has a collective decision, its performance is better than the
other two models. Random forest is one such example. It is computationally less
intensive than wrapper methods. However, this method has a drawback of specific to
a learning model.
1.1. FS procedure
It is proved from the literature that feature selection can improve the performance of
prediction, scalability and generalization capability of the classifier. In knowledge
discovery, FS plays a fundamental role in reducing the computational complexity,
storage, and cost [6].
Fig. 1 shows the various stages of FS process [7] which is explained below. Its
performance depends on the decision taken at every level.
1. Search direction. A n g et al. [7] state that the first stage of the FS process
is finding the search direction and the starting point. The search directions are broadly
classified into three types of forward search, backward search, and random search.
The search process can start with an empty set where new features are added
recursively in every iteration such phenomenon is known as forward searching. In
converse to it, the backward elimination search start with a complete set of features
and features are removed iteratively until the desired subset of features is reached.
The other variant approach is a random searching method, which builds the feature
subset by both adding and removing of the features iteratively. After determining the
search direction search strategy can be applied.
2. Determine search strategy. From the literature, we come to know that
search strategies can be randomized, exponential and sequential search. Table 1
enumerates the different search strategies and their algorithms. The drawback of
exponential search is that it requires 2N combinations of feature selection for N
features. It is an exhaustive search strategy, and it is an NP-hard problem [8]. To
overcome this drawback researchers has introduced randomized search strategies.
In sequential search, sequentially features are added to an empty set or remove
features from the complete set. Which is referred to as Sequential Forward Selection
(SFS) and Sequential Backward Selection (SBS) respectively. The drawback with
these methods is the features that are eliminated will not be considered for further
iterations. This phenomenon is known as nesting effect. To overcome this
disadvantage, F e r r i and P u d i l [9] in 1994 proposed Plus-l-minus-r (l – r) search
method. These methods have Ө (2M) complexity for selecting l feature from the set
of M features.
A good search strategy should obtain an optimal solution, local search ability
and computational effectiveness [2]. Based on these requirements searching
algorithms are further classified as optimal and suboptimal feature selection
algorithms. The nesting effect of SFS and SBS algorithm are overcome by P u d i l,
N o v o v i č o v á and K i t t l e r [10] in 1994 SFFS and SBFS algorithms. Different
search
4 techniques, categories, their merits, and demerits are stated in Table 2.

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


3. Evaluation criteria. The best features are selected based on the evaluation
criteria. Based on the evaluation methods FS methods [23] are classified into Filter
method [24-26], Wrapper method [27-29], Embedded method [30], and Hybrid
method [31, 32].
4. Stopping criteria. Stopping criteria specify when the FS process should
stop. A good stopping criterion leads to low computational complexity in finding an
optimal feature subset and also overcomes the over-fitting problem. The selection of
the stopping criterion is influenced by the choices made in the previous stages. Some
of the common stopping criteria are:
 Pre-defined No of features
 Pre-defined No of iterations
 Percentage (%) of advancement over two successive iteration steps
 Based on the evaluation function.
6 5. Validate the results. To validate the results the feature sets validation
techniques are used. Cross-validation, Confusion matrix, Jaccard similarity-based
measure, Rand Index are some of the validation techniques. Cross-Validation (CV)
is most commonly used validation methods. The main advantage of the CV method
is that it gives an unbiased error estimate. Confusion Matrix is generated for the
evaluation of the classifier. Some of the classification and clustering measures
commonly used are:
Classification Measures Clustering Measures
Error Rate Davies-Bouldin Index
TP Rate/ Recall / Sensitivity Dunn Index
Specificity F-Measure
ROC (Receiver Operating Characteristic) Curve Jaccard index
Precision Dice index
F-Score / F-Measure Fowlkes-Mallows index
Further, the paper is outlined as follows: Section 2 explains how feature
selection is divided based on the evaluation criteria. Section 3, elaborates how feature
selection methods are applied based on the class label. Section 4 explains the
applications areas of feature selection. Section 5 describes the summary and future
scope of feature selection methods.

2. Feature selection based on Evaluation criteria


In this section, we are discussing the FS algorithms that depend on evaluation criteria.
Based on evaluation criteria and interaction with learning algorithm feature selections
are classified into three types as: 1) Filter method 2) Wrapper method, and
3) Embedded method.
2.1. Filter method
In this method, the model starts with all features and selects the best features subset
based on statistical measures such as Pearson’s correlation [33], Linear Discriminant
Analysis (LDA), ANOVA, Chi-square [34], Wilcoxon Mann Whitney test [35], and
Mutual Information (MI) [36-39]. All these statistical methods depend on the
response and feature variable present in the dataset. Pearson’s Correlation (PC) and
Mutual Information methods are commonly using statistical methods.
2.1.1. Pearson’s correlation
Correlation is a method of finding the relationship between two Quantities, for
example, age and height. PC is used for detecting the linear relationship between two
variables. The next equation is used to calculate the PC (ρ) between the independent
variable x and dependent variable y:
∑𝑖(𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦)
(1) 𝜌(𝑥, 𝑦) = .
√∑𝑖(𝑥𝑖 −𝑥)2 (𝑦𝑖 −𝑦)2
Generally, the PC value lies in between [–1, 1] if the value is –1 then the
variables are negatively correlated otherwise if the value is 1 then the variables are
positively correlated. In case that the value is 0, then there is no correlation between
the variables.

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


2.1.2. Mutual information
MI is another statistical method used in FS. It is the measure of how two variables
(a, b) are mutually dependent. It evaluates the “measure of data” gathered about one
arbitrary variable, through the other random variable. Equation 2 is used to calculate
MI between two discrete random variables a and b,
𝑝(𝑎,𝑏)
(2) 𝐼(𝐴, 𝐵) = ∑𝑏𝐵 ∑𝑎𝐴 𝑝(𝑎, 𝑏) log ( ),
𝑝(𝑎)𝑝(𝑏)
where p(a, b) is the joint probability function of A and B, and p(a) and p(b) are the
marginal probability distribution functions of A and B respectively.
For continuous random variables, the summation is replaced by a double
integral as
𝑝(𝑎,𝑏)
(3) 𝐼(𝐴, 𝐵) = ∫𝐵 ∫𝐴 𝑝(𝑎, 𝑏)log ( ) 𝑑𝑎 𝑑𝑏.
𝑝(𝑎)𝑝(𝑏)
In the filter method, each feature is assigned a scoring value using statistical
measures. Features are organized in descending order based on the scores and assign
ranking for the features. A subset of features is selected using threshold value. Filter
method takes less computational time for selecting the best features. As the
correlation between the independent variables is not considered while selecting the
features, this leads to selection of redundant features.
Filter method uses characteristics such as information gain, consistency,
dependency, correlation, and distance measures. K i r a and R e n d e l l [40] in 1992
proposed a method called Relief work based on instance based learning [41]. It uses
Euclid distance for selecting Near-hit and Near-miss. Let X denote an instance and
an instance Xi is called as Near-hit of X when it a close neighbor of X and also same
class label as X. Similarly an instance Xi is called as Near-miss of X when it properly
close neighbor of X and different class label as X, T denotes a relevance of threshold
ranges from 0 ≤ 𝑇 ≤ 1. The algorithm calculates feature weight based on the average
Near-hit and Near-miss. It selects the features whose feature weight is greater than T.
The drawback in this algorithm is that it is applicable to two classes of classification
problems and fails to discard redundant, incomplete features.
To overcome the problems with Relief K o n o n e n k o [42] in 1994 proposed
an extension to Relief called Relief-A for addressing the incomplete data problem,
Relief-B if at least one of two instances has unknown value for a given attribute. To
address the multiclass problem, Relief-F is introduced. In Relief uses only one Near-
hit/Miss for selecting the feature, whereas in Relief-A uses k-Nearest hits/misses and
consider the average of these k-Nearest hits/misses instead of one near hit/miss. In
Relief-F instead of finding one near miss M from a different class, the algorithm finds
one near miss for each different class and averages their contribution for updating
feature weights. This algorithm also fails to remove the redundant data.
Mutual Information based Feature selection was used by the B a t t i t i [43] in
1994 to address the above-said issue. It not only finds the relevancy between the
target class and individual features but also finds the relevancy between the individual
features. In the selection process, the information gained by the variables helps to
rule out the redundant features.
Y a n g and M o o d y [44] in 1999 proposed a method called Joint Mutual
Information based approach (JMI) for the classification and visualization of non-

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


Gaussian data. Traditional MI has a drawback that, the method discards only a few
redundant variables. For identifying the relevant feature, it calculates the JMI
between the target feature and the individual features. To discard all redundant
features JMI uses Conditional MI instead of normal MI.
Similarly P e n g, L o n g and D i n g [38] in 2005, proposed a heuristic
algorithm called minimal-Redundancy-Maximal-Relevance (mRMR) for the
removal of redundant data. It also finds most relevant features during the optimal
feature subset selection, and without compromising the classification accuracy.
Based on the information theoretic (like mutual information) selection criterion
M e y e r and B o n t e m p i [45] in 2006 proposed a new filter selection called
Double Input Symmetrical Relevance (DISR) for supervised classifications. These
methods use the variable complementary measures for finding intrinsic features and
more information about the target class than that of an individual feature.
S o n g, N i, and W a n g [46] in 2013 proposed an algorithm for feature subset
selection called a Fast clustering-based feature selection algorithm (FAST) based on
graph theory. The algorithm uses the graph technique Minimum Spanning Tree
(MST) for clustering the features. This FAST algorithm had effectively removed
irrelevant features and redundant features by using symmetric uncertainty measure
[47]. For choosing the optimal features FAST algorithm uses the cluster-based
methods.
2.2. Wrapper method
According to K o h a v i and J o h n [48] in 1997 the feature subset selection in
wrapper method is made as a black box, i.e., there is no knowledge about the
underlying algorithm. Feature subsets are selected based on inductive algorithms.
This chosen feature subset estimates the accuracy of the training model. Depending
on the accuracy measured from the previous step, the method will decide whether to
add or remove a feature from the selected subset. Due to this, the wrapper methods
are computationally more complex.
K o r f i a t i s et al. [49] in 2013 proposed a novel wrapper FS algorithm called
LM-FM method; it comprises of two stages namely Local Maximization (LM)
followed by a Floating Maximization (FM). In the LM stage best subset of features
are selected from the original set of features based on credit score values between the
features and the target class. Then this best subset of features is taken as input for the
FM stage. In the FM stage, optimal features are selected by using floating size feature
selection algorithm like Sequential Floating Forward Selection (SFFS). Korfiatis et
al. combine the SVM classifier with LM-FM to show better classification.
Wrapper methods are good at classification accuracy and bad at computational
efficiency. So, to overcome this problem, G. C h e n and J. C h e n [50] in 2015
proposed a new wrapper method namely “Cosine Similarity Measure Support Vector
Machines” (CSMSVM). This technique uses the SVM classifier itself for selection
of relevant features at the time of classifier construction, by including the cosine
distance into SVM. This technique not only decreases the intraclass distances for the
reduction of classification error rate but also it optimizes the margin in SVM. The
proposed method had increased the computational efficiency to a maximum extent.

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


P a n t h o n g and S r i v i h o k [51] in 2015 proposed an algorithm for wrapper
FS method based on ensemble learning algorithms. In this approach, Panthong and
Srivihok use three types of search strategies in the wrapper method namely SFS, SBS
and Optimize selection. These methods are combined with ensemble learner. The
empirical analysis of different combinations of the search strategies – Sequential and
heuristic, bagging, boosting, decision tree and Naïve Bayes classifiers are used. They
are: 1) FBDT (SFS + Bagging + Decision Tree), 2) BBDT (SBS + Bagging +
Decision Tree), 3) OBNB (Optimize selection (evolutionary) + Bagging + Naïve
Bayes), 4) OADT (Optimize selection (evolutionary) + Ada Boost + Decision Tree),
and 5) OANB (Optimize selection (evolutionary) + Ada Boost+ Naïve Bayes). The
study on FBDT, BBDT, OBNB, OADT, and OANB reveals that the prediction results
are more accurate when combining with evolutionary algorithm ‒ heuristic search for
feature selection.
D a s et al. [52] in 2017 use wrapper FS method based on harmony search.
Harmony search is a meta-heuristic algorithm, which uses the concept of musical
procedure for searching an idealistic harmony. In this work, instead of heuristic
search, harmony search is used for subset section and is applied to identify the
suitable words in native language (Bangla: Indian origin) words.
In general, the wrapper method takes more time complexity. To overcome this,
W a n g et al. [53] in 2017, proposed a novel approach by combing the wrapper
method and filter method. This method uses the Markov blanket technique along with
the wrapper-based FS for reducing the computational time. Markov blanket technique
can explicitly remove the redundant feature by considering the relevance is between
the features. It uses a cross-entropy based measure for this purpose. The features that
are considered as redundant are conditionally independent of the target class. To
reduce the no of wrapper computations the unnecessary features are identified using
the filter method rather than the wrapper method.
Over past years to extract meaningful information from sensor data filter based
FS method has been used. However, this approach requires more time complexity,
and less occupancy estimation. M a s o o d, S o h and J i a n g [54] in 2017 have
proposed a new ranking based incremental search strategy method-WRANK-ELM.
This uses wrapper-ranking based feature selection, which is named after Extreme
Learning Machine (ELM) classifier. The time complexity of this method is improved
by adopting an incremental search strategy rather than sequential and exhaustive
search. This considerably saves computational costs. The best-selected features are
evaluated by using EML classifier. Experimental results of this approach outperform
the results of another wrapper method.
B e r m e j o [55] in 2017 propose another different view on wrapper method
that is applying combined wrapper methods for feature selection. This ensemble
wrapper approach was tested on fish age prediction. Here Bermejo uses the ensemble
techniques during the wrapper method, i.e., CV is calculated for each selected feature
subsets, and the Mean CV error is calculated based on this value. Subset which has
minimum CV error, that subset is considered as the best set? The collective
performance of the wrapper method has shown good improvement in the accuracy.

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


The combination of Genetic Algorithm and Logistic Regression (GA-LR)
applied by the K h a m m a s s i and K r i c h e n [56] in 2017 was used to detect the
intruders in the network. The heuristic search strategy using GA has derived the best
optimal subset of features. This is evaluated by using Logistic Regression. By using
this LR method, the relationship between the dependent and explanatory variables
are described. This approach has capabilities of dealing with categorical data also.

2.3. Embedded method


So far, the feature selection methods that we discussed earlier use FS at the pre-
processing level. The following algorithms that we are going to discuss are an
embedded method. This method works in a way that the best features are selected
during the learning process. The blending of feature selection during learning process
has advantages of improving computational cost, classification accuracy and also
avoids training the model each time when a new feature is added.
The Embedded method selects the feature subset, and the interactions of the
learning algorithm were different from other feature selection methods. Filter based
method learning algorithms are not used for feature selection, whereas Wrapper based
method uses the learning algorithm for testing the quality of selected feature subsets.
Embedded Method overcomes the computational complexity. In this method,
appropriate feature selection and model learning are performed at the same time, and
the features are selected during the training stage of the model. Due to this, the
computational cost of this method is decidedly less compared with the wrapper
method. This method avoids the training of the model each time when a new FS is
explored.
M o h s e n z a d e h et al. [57] in 2013 proposed an algorithm called Relevant
Sample-Feature Machine (RSFM) based on sparse Bayesian machine learning
algorithm. The RSFM based learning model is sparse due to the adoption of Gaussian
priors and Bayesian approach. RSFM is an extension of the Relevance Vector
Machine (RVM) [58] algorithm; it is a sparse kernel-based learning method. In this
method, the output is predicted by using the kernel function f(x), i.e.,
M
(4) f  x | w   w0   wm k  x, xn ,
M 1

where k ( x, xn ) are kernel function and w  ( w0 , w1 ,..., wm )T , the weight vector.


M i r z a e i, M o h s e n z a d e h and S h e i k h z a d e h [59] in 2017 proposed an
Embedded FS method called Variational RSFM commonly referred to as VRSFM
which is based on a Bayesian model of RSFM [57]. The proposed feature selection
method is used for both classifications as well as regression. It defines prior Gaussian
distribution on the model parameters and its hyper-parameters. For finding the hyper-
parameters and posterior distributions of the parameters M i r z a e i,
M o h s e n z a d e h and S h e i k h z a d e h [59] employs Variational Bayesian
approximation. The algorithm works well for small size dataset.
The strengths and gaps of the FS methods are listed in Table 3. Fig. 2 represents
the steps involved in the process of Filter, Wrapper, and Embedded methods.

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


2.4. Findings
From the literature survey above we came to know that most of the filter methods are
using statistical measures for feature selection. K i r a and R e n d e l l [40] in 1992
use Relief method based on distance measure but this method fails to discard the
redundant, incomplete feature. To address these problems, K o n o n e n k o [43] in
1994 extended the Relief to Relief – A, B, F for address the incomplete, unknown
and multiclass problems respectively. But this method also fails to remove redundant
data, so B a t t i t i [43] in 1994 use MI-based FS to rule out the redundant features.
All these methods are computationally faster but lack in the accuracy of the model.
To overcome these drawbacks wrapper-based, FS are introduced.
In the wrapper method, the features are selected based on the underlying
learning algorithm but it is computationally slow due to iteratively selecting for the
best subset of features. Initially day sequential search strategies are used for selecting
the subset of features. But it has a nesting effect to overcome this P u d i l,
N o v o v i č o v á, and K i t t l e r [10] in 1994 came with the idea of SFFS (Sequential
Forward Floating Selection) and SBFS (Sequential Backward Floating Selection).
These methods also have the drawback of the searching overhead. To overcome this
Heuristic Search and optimal search strategies based on a bio-inspired algorithm for
selection of optimal features with less overhead are applied. So, there is a lot of future
scope in the optimization of the search strategies for better selection of the feature
subsets.
There is another scope of combining the filter and wrapper to form a hybrid
algorithm for better accuracy and time complexity.

3. FS methods based on Learning method


Based on the presence and absence of class labels feature selection methods
Supervised FS, Unsupervised FS follow, respectively. When the dataset has both
labeled and unlabelled data Semi-supervised Feature selection can be used.
3.1 Supervised FS
This approach uses the class label for selecting relevant features. Most of the time
this approach causes overfitting problem due to the presence of the noisy data in the
dataset. Some of the widely used supervised Feature selection methods are the Fisher
score [60], Hilbert-Schmidt Independence Criterion (HSIC) [61], Fisher Criterion
[62], Pearson Correlation Coefficient [63], trace ratio criterion [64] and mutual
information [38].
S o n g et al. [61] in 2007 proposed a supervised feature selection method called
BAHSIC. The dependence is estimated by using the Hilbert-Schmidt Independence
Criterion (HSIC) [65], and the features are selected using backward elimination.
HSIC kernel is used for measuring the dependencies. Most of the feature selection
methods are applicable either for binary classification or regression but not both. The
BAHSIC method has the advantage of being applied to problems of regression,
binary class and multi-class classification with less computational time compared to
other FS methods.
T u t k a n, G a n i z and A k y o k u ş [66] in 2016 proposed a novel feature
selection method called Meaning Based Feature Selection (MBFS) for text mining
that uses Supervised and Unsupervised learning. MBFS was based on the Helmholtz
principle [67] and Gestalt theorem of human perception [68], for selecting the
features it uses meaning measure. Helmholtz principle from the Gestalt theory is used

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


for assigning a meaning score for each word in the document. For measuring the
meaning score is used the next equation:
1 𝑘
(5) meaning(𝑤, 𝑐𝑗 ) = − 𝑚 log [ ] − [(𝑚 − 1) log 𝑁],
𝑚
where 𝑤 is a feature that appears 𝑘 times in 𝑠 dataset, 𝑚 times in a document of 𝑐𝑗
class and 𝑁 is the rate of length of dataset.
M a r t í n-S m i t h et al. [69] in 2017 used supervised filter method for the
classification of a Brain-Computer Interface (BCI) by using Linear Discriminant
Analysis (LDA) classifier. It extracts the features from ElectroEncephaloGram
(EEG) signals, for analyzing the extracted signals Multi-Resolution Analysis (MRA)
method had used. The proposed Filter approach had improved the formulation of
multi-objective FS. For obtaining an optimal feature subset [69] had multi-objectives,
they are first the method should increase the accuracy of the classifier and second the
method should have overcome the overfitting problems. These had been achieved by
evaluating the classifier and adjusting the parameters suitable during the training
phase.
In the past Spectral Feature Selection (SFS) had been used in feature selection.
But it fails to preserve either local or global structure of the data-set in the form of
graph matrix. Another drawback is that it uses the original data for matrix learning
every time. To overcome the problems with SFS [70] propose a novel supervised
feature selections method by preserving both global and the local structure of the data
set. For maximizing the objective function at a fast rate, it uses an optimization
method. The graph matrix learning and the low-dimensional feature space learning
are coupled as a unified framework. For preserving the global and local structure, it
uses subspace-learning methods like LDA and Locality Preserving Projection (LPP)
respectively, LDA uses low-rank constraint whereas LPP uses graph structure
learning and for eliminating the irrelevant features, it uses l2,1 – norm regularizer for
sparse feature selection.
3.2. Graph-based unsupervised FS
In Unsupervised feature selection, the data set is unassisted by the class label, so it
was the most challenging task than Supervised and Semi-supervised FS. Based on
the similarity measures the redundant features are removed.
If the features are similar with one or more features then one of these features
are removed, similarly, if a feature did not make any contribution to clustering, then
such features are eliminated during feature selection process. It is essential for
exploratory data analysis of biological data and, also useful for effectively finding
unknown disease types. There are some demerits with this Unsupervised feature
selection; The selected subsets did not consider the correlation between different
features. Some of the well known unsupervised feature selection algorithms are
Variance Score [71], Unsupervised Feature Selection using Feature Similarity
measure (FSFS) [72], Laplacian Score for Feature Selection (LSFS) [73], Spectral
analysis based feature selection [74], Multi-Cluster Feature Selection (MCFS) [75],
and Unsupervised Discriminative Feature Selection (UDFS) [76].
In earlier unsupervised feature selection [73], He, Cai, and Niyogi 2006 uses
feature ranking based techniques as fundamental criteria for feature selections. As the

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


feature measures are calculated independently, the relationship between them is not
considered. To overcome this issue, Z. L i et al. [77] in 2012 and Y. Y a n g et al.
[76] in 2011 have proposed a spectral-based clustering approach. In this method, the
cluster structure of the data had been explored by using matrix factorization. The
features are selected by using the sparsity regularization model based on learned
graph Laplacian.
B a n d y o p a d h y a y et al. [78] in 2014 use dense sub-graph based on feature
clustering for unsupervised feature selection. In this method first, the original feature
set is represented in the form of a graph, it consists of all features being portrayed as
vertices of the graph, and the edge weights are found by using the inter-feature
similarity. What is computed by using mutual information? In this method the feature
selection is performed in two stages, the first stage, densest sub-graph had been
obtained with nonredundant features and the second stage minimizes the feature sets
by using feature clustering from the graph.
X. W a n g et al. [79] in 2016 propose Unsupervised Spectral Feature Selection
(USFS) with l1-norm graph. It is based on the method, Spectral Embedded Clustering
[80]. For selecting the discriminative features, USFS uses l1-norm graph and spectral
clustering. By using spectral clustering, the cluster indicators are obtained from the
unlabelled data sets. For cross-checking, the selected features l1-norm graph had been
imposed. It is not clear whether the manifold structure with the existing spectral
feature selection method could overcome this USFS manifold structure is used for
clarity.
W e n et al. [81] in 2016 proposed Unsupervised Optimal Feature Selection
(UOFS) for FS. UOFS is based on l2,1-norm regularization matrix instead of l1-norm
graph. This is because in l1-norm graph has two phases namely 1) graph construction
and 2) subspace learning before classification. But these phases are not optimal for
classification. So, W e n et al. [81] use l2,1-norm based sparse representation model,
and for subspace learning, it uses l2,1-norm regularization. The sparse representation
in this method had been used for feature selection and extraction for classification.
S. W a n g and H. W a n g [82] in 2017 proposed a novel method for
unsupervised feature selection based on low-rank approximation and structure
learning. Using low-rank approximation one can provide an exact evaluation for the
number of Connected components of embedded graphs in structure learning. The
primary step in this method is to represent the feature selection problem as a matrix
factorization with low-rank constraints by using a self-representation of a data matrix.
For capturing the sparsity of the feature selection matrix, l2,1-norm method had been
used. Based on these structured learning and low-rank approximation techniques, an
efficient algorithm had been implemented. There are some demerits with this method
namely, how to learn the feature subsets adaptively.
Y. L i u et al. [83] in 2017 proposed a novel method for Unsupervised feature
selection called Diversity-Induced Self-Representation (DISR), based on Self
representation property [84] and also used an algorithm called Augmented Lagrange
Method (ALM) for efficient optimization. By using diversity property, more
information about the data can be captured, which helps in discarding the similar
features. Due to this redundant features are significantly removed. The similarity

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


between the m-th and n-th features can be calculated using dot product weight
as 𝑠𝑚𝑛 = 𝑓𝑚𝑡 𝑓𝑛 , 𝑚, 𝑛 = 1, 2, ⋯ , 𝑖. A larger 𝑠𝑚𝑛 means m-th and n-th features are
more similar. For selecting the most valuable features, it uses both diversity
properties and representativeness.
H u et al. [85] in 2017 proposed a novel method called Graph Self
Representation Sparse Feature Selection (GSR-SFS) for Unsupervised Feature
selection. For improving the stability of feature, the selection is achieved by
integrating a subspace-learning model, (i.e., LPP) into a sparse feature level self-
representation method. To achieve interpretation ability the technique uses the feature
level self-representation loss function, similarly to produce stability for subspace
learning it uses l2, 1-norm regularization.
D u et al. [86] in 2017 proposed Robust Unsupervised Feature Selection via
Matrix Factorization (RUFSM) method for unsupervised feature selection. The data
matrix is decomposed into two matrices which contain latent cluster centers and
sparse representation using l2,1-norm. High accurate discriminative feature selection
is achieved by estimating the orthogonal cluster centers.
Q i et al. [87] in 2017 proposed a novel method called Regularized Matrix
Factorization Feature Selection (RMFFS). Matrix factorization determines the
correlation among features. For making the feature weight matrix as sparsity, it
considers the absolute values of the inner product of the feature weight matrix. The
combination of l1-norm and l2 norm is used for matrix factorization.
3.2.1. Findings
H e, C a i and N i y o g i [73] in 2006 use rank based for feature selections, but they
do not consider the relationship between them. L i et al. [77] in 2012 and Y a n g et
al. [76] in 2011 address these issues by using spectral based clustering approach.
Later most of the researchers use graph-based learning for feature clustering. Initially,
l1-norm based graph learning was used. But using this the classifications are not
optimal to overcome this l2,1-norm graph-based methods are introduced. l2,1 based
method is gaining more popular due to its optimal feature selection.
3.3. Semi-supervised feature selection
In Semi-supervised feature selection, the learning data set contains both labeled and
unlabeled data.
The graph Laplacian methods have gained the attention of many researchers
working on semi-supervised based feature selection. The weighted graph is
constructed for the given data for which the feature selection is applied [88].
There are mainly three stages of Graph based on Semi-Supervised Learning
(GSSL). Firstly, assessing the proclivity (affinity) between a pair of samples (or) sets
the users may choose kernel or similarity function. B e l k i n and N i y o g i [89] in
2008 used a Gaussian kernel model as a similarity function, and his empirical study
proves its performance. Secondly, the users have to choose an appropriate algorithm
for the construction of sparse weighted subgraph from the completely weighted graph
between all set of nodes. Some of the regularly used algorithms for the development
of sparse subgraphs are k-Nearest Neighbour (k-NN) and Ꜫ-Neighbourhood. Finally,

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


the user has to use a graph-based SSL algorithm for diffusing the class labels on the
known node of the graph to the unknown data nodes.
Some of the graph based SSL algorithms are graph min-cut method [90],
Gaussian fields and harmonic methods [91], the global and local consistency method
[92], the manifold regularization [89] and the alternating graph transduction method
[93].
Most of the methods in GSSL follow neighborhood methods like k-NN. From
the literature, it is clear that the neighborhood approach constructed using GSSL
generates irregular and imbalanced graphs for real and synthetic data. The greedy
approach in adding nodes to the graph, which is based on k nearest points, is the cause
for the above-said issue. To overcome the drawback of k-NN based GSSL in [93]
proposed a method named maximum weight b-matching. In this method, each node
had precisely b nodes in the graph.
Z h a o and L i u [94] in 2007 had proposed an algorithm called sSelect for semi-
supervised learning based on spectral graph analysis. The algorithm ranks the features
in the way similar to Fisher score by using a regularization framework. This algorithm
selects the features one by one without considering the relationship between the
features.
Generally, this graph based semi-supervised feature selection has broad
applications in the area of image annotation. M a et al. [95] in 2012 Proposed an
algorithm called Structural Feature Selection with Sparsity (SFSS), by using
automatic image annotation. Y. Y a n g et al. [96] in 2012 used a joint framework by
joining shared structure learning and graph-based learning for annotating the web
images. For annotating the noisily tagged web images, T a n g et al. [97] in 2011
have proposed an algorithm called a novel k-NN sparse graph-based SSL approach.
3.3.1. Findings
From the literature, we came to know that there are mainly two drawbacks to the
graph-based semi-supervised feature selection. First, these methods are not suitable
for the large-scale data set, due to the presence of a large number of training datasets
and also due to that they consume more time for the construction of graph like
Laplacian matrix. Second, it selects the features independently without considering
the correlation between the features.

4. Applications
These days there is a demand for computational power, processing capacity and
storage to handle the volume of data in various fields such as massive Image
processing, Microarray data, Graph Theory, Gene Selection, Network security and so
on. The massive data is the major concern for the learning models. To improve the
performance of the learner, it is very much essential to apply dimensionality
reduction techniques to generate compact and error-free data for better results. In the
following paragraphs we explain in details about each application area in details.

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


4.1. Hyperspectral images
In the standard image, only RGB spectral bands will be present whereas in
Hyperspectral images there are several hundreds of spectral bands available. So, each
pixel is used for the characterization of the objects. These hyperspectral images have
been widely used in applications like remote sensing, medical imaging and so on. In
hyperspectral images, the data contains rich information for different applications,
but not all the measures are crucial for a particular application. Due to the presence
of a large number of spectral bands, this leads to presence of redundancy between
these bands. So, there is a need for feature selection method for the elimination of the
redundant bands.
So, feature selection techniques are essential for selecting the relevant subset of
features. Gabor wavelet transformation-based feature extraction has increased the
performance of the classifiers, but in this method, too many Gabor features are
extracted which makes the burden on the computation and efficiency of the
technique. To overcome this in [98] a multi-task joint sparse representation
framework based Gabor cube feature selection is proposed. In [99] is explained about
various issues and the challenges in FS for hyperspectral image analysis.
4.2. Intrusion detection
Nowadays network-based technologies are increasing rapidly and the attacks on these
techniques also increases. To overcome this problem, we should have to build a high
secured Intrusion Detection System (IDS). These IDS need to deal with high
dimensions of data which contain noisy, redundant and irrelevant data. This leads to
a decrease in the intrusion detection rate and requires more computation time. So to
achieve high detection rate FS methods are needed.
A m i r i et al. [100] in 2011 propose an FS method for IDS using Mutual
Information a filter method approach. Moreover, Y. C h e n et al. [101] in 2006
explain about different feature selections available for IDS. From the literature, it is
clear that hybrid based methods are more reliable and suitable for this application
when compared to the wrapper method. However, wrapper based methods are useful
when the data size is small. Whereas the filter method is used for the fast
computational process, but it is less accurate.
4.3. Microarray data
Generally, microarray gene selection data consists of hundreds and thousands of
features and have few rows. This becomes challenging for the learning models, so
there is a need for reducing the dimensions of the data. A n g et al. [7] in 2016 clearly
explain various gene selection methods for supervised, unsupervised and semi-
supervised based learning models. M a n d a l and M u k h o p a d h y a y [102] in
2013 proposed an improved mRMR feature selection for gene expression data. In the
existing literature, most of the methods use either redundancy or relevance feature
selection methods. Whereas Mandal and Mukhopadhyay, 2013 proposed a method
where redundancy and relevances are considered parallely.
Interestingly, semi-supervised and unsupervised feature selection in selecting
the gene features outperforms the supervised feature selection models. Other methods

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


like hybrid and ensemble frameworks are also producing more significant and good
classification results. Few researchers are only attempted on hybrid and ensemble
approaches and showed that it gives the promising result. Therefore there is more
scope for improvement in this lines of selection.
Table 4. Applications of FS methods
Author & Year Application Algorithm Approach
H u e r t a, D u v a l and
Microarray data Genetic Algorithm Hybrid
H a o [103], 2006
D u v a l, H a o and Genetic Algorithm and
Microarray Embedded
H e r n a n d e z [104], 2009 iterated local search
C h u a n g, Y a n g and
Microarray data PSO + Tabu search Wrapper
Y a n g [105], 2009
J i r a p e c h-U m p a i and
Microarray Genetic Algorithm Wrapper
A i t k e n [106], 2005
R o f f o and M e l z i
Microarrays Eigenvector Centrality FS Filter
[107], (2016)
Unsupervised
D u et al. [86], 2017 Handwritten digit RUFSM
FS
P e n g, L o n g and D i n g
Handwritten digits mRMR Wrapper
[38], 2005
O h, L e e and S u e n
Handwriting recognition Class dependent features FS
[108], 1999
K a p e t a n i o s [109], Simulated Annealing and
Economy Wrapper
2005 Genetic Algorithm
Feature subset
A l-A n i [110], 2005 Texture classification Ant Colony Optimization
selection
Symmetrical Uncertainty
Hyperspectral image
S h e n et al. [111], 2013 (SU) and Approximate Filter
classification
Markov Blanket (AMB)
Locally Linear Embedding
Y a o et al. [112], 2017 Image recognition Filter
(LLE)
Z h a n g et al. [113], 2014 Spam detection Mutation + Binary PSO Wrapper
A m b u s a i d i et al. [114],
Intrusion detection Mutual Information Filter
2016
A l o n s o-A t i e n z a et al. Detection of life-
F-score and mRMR Filter
[115], 2014 threatening arrhythmias
R o f f o, M e l z i and
Computer vision Infinite FS Filter
C r i s t a n i [116], 2015
Z h a n g et al. [117], 2015 Alzheimer’s disease Welch’s t-test Filter
M a r t i n-S m i t h et al. Linear discriminant
Brain-computer interface Supervised FS
[69], 2017 analysis
Fault detection and Information Greedy
L i et al. [118], 2017 Filter
diagnosis Feature Filter (IGFF)

In Table 4 the different FS methods and its applications are presented. The filter-
based approaches are widely used methods.

5. Conclusion and future scope


As we are in the digital era every moment it generates million, billion of data. This
increases the burden for processing, which in turn affect the decision making on any

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC


application. This drags the attention of the researchers to come up with the best
feature selection model that suits for any application irrespective of the constraints.
So, we need to reduce the dimensions of the data by using some of the dimensionality
reduction methods mentioned above.

Fig. 3. Ratio of FS approaches used in different domains

It is observed from the literature, that filter-based feature selections are


computationally faster when compared with the wrapper method and less accurate.
Whereas in wrapper method the accuracy is more but computationally costlier.
Dimension reduction provides several advantages: it results in a low dimensional
model, requires less memory space, reduces the risk of overfitting, better accuracy
and reduces the time complexity.
From Fig. 3 it specifies most of the researchers use Gene selection as an
application area. Moreover, from Fig. 4 correlation criteria algorithm had been used
by most of the researchers.
After doing a critical literature survey, it is clear that most of the experimental
analysis is carried out on the static dataset. In reality, many applications generate
dynamic and live data, which tends to drift the concept frequently. So, there is scope
for understanding the concept and propose suitable dimensionality reduction model.
In wrapper feature selection methods, sequential search is using for the selection
of feature subsets. Due to this, it increases the time complexity. To overcome this
problem some of the researchers introduce a genetic algorithm-based searching
methods like Ant Colony Optimization; Practical Swarm Optimization are some of
the commonly used methods. By using the optimization methods, there is a
significant improvement in the feature subset selection. So most of the future research
works are carried out by using this methods only. There is more future scope in this
area.
Another area of feature scope is using hybrid methods by combining filter and
wrapper method for better performance of the reduction techniques. Here in this
hybrid methods the researcher is coming with filter and wrapper method to improve
the performance of the classification algorithms. Using this approach also has a more
future scope.
Finally, another area of future scope for dimensionality reduction is using of
graph based feature selection for unsupervised feature selction.

Unauthentifiziert | Heruntergeladen 03.03.20 12:36 UTC

You might also like