0% found this document useful (0 votes)
57 views49 pages

Feature Selection Techniques in Machine Learning

The document discusses feature selection techniques in machine learning, emphasizing their importance for improving model performance, reducing overfitting, and enhancing interpretability. It categorizes feature selection methods into supervised and unsupervised techniques, detailing various approaches such as filter, wrapper, and embedded methods. Additionally, it provides a comparison of model performance with and without feature selection, highlighting the benefits of using these techniques in real-world datasets.

Uploaded by

shalini gambhir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views49 pages

Feature Selection Techniques in Machine Learning

The document discusses feature selection techniques in machine learning, emphasizing their importance for improving model performance, reducing overfitting, and enhancing interpretability. It categorizes feature selection methods into supervised and unsupervised techniques, detailing various approaches such as filter, wrapper, and embedded methods. Additionally, it provides a comparison of model performance with and without feature selection, highlighting the benefits of using these techniques in real-world datasets.

Uploaded by

shalini gambhir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Feature Selection

Techniques in Machine
Learning
Dr. Shalini Gambhir

Dr. Shalini Gambhir


Understanding Feature Selection in Machine Learning:
When working with real-world datasets, it’s rare for all variables to contribute equally to predicting the target variable.
By implementing feature selection techniques, we narrow down the set of features to those most relevant to the machine
learning model, ultimately leading to more accurate and efficient predictions. This technique helps to improve model
performance, reduce overfitting, and enhance generalization by eliminating redundant or irrelevant features

Dr. Shalini Gambhir


Importance of Feature Selection
1.Improved Model Performance: By focusing on the most relevant
features, the accuracy of the model can be enhanced in predicting new,
unseen data.
2.Reduced Overfitting: Fewer redundant features mean less noise in
data, decreasing the chances of making decisions based on irrelevant
information.
3.Faster Training Times: With a reduced feature set, the algorithms
can train more quickly, which is particularly important for large-scale
applications.
4.Enhanced Interpretability: By focusing on the most important
features, better insights can be gained into the factors driving the
model’s predictions.
5.Dimensionality Reduction: Feature selection helps to reduce the
complexity of the model by decreasing
Dr. Shalini Gambhir
the number of input variables.
Table comparing model performance with
and without feature selection:
Metric Without Feature Selection With Feature Selection

Accuracy 82% 89%

Training Time 120 seconds 75 seconds

Number of Features 100 25

Interpretability Low High

Dr. Shalini Gambhir


Types of Feature Selection Techniques
Feature selection techniques in machine learning can be broadly classified into two
main categories:
Supervised Feature Selection
Unsupervised Feature Selection.

Dr. Shalini Gambhir


1. Supervised Feature Selection
• Supervised feature selection techniques use labeled data to identify the most
relevant features for the model. These methods can be further divided into
three subcategories:
1.Filter Methods: These methods assess the value of each feature
independently of any specific machine learning algorithm. They’re fast,
computationally inexpensive, and ideal for high-dimensional data.
2.Wrapper Methods: These techniques train a model using a subset of features
and iteratively add or remove features based on the model’s performance.
While they often result in better predictive accuracy, they can be
computationally expensive.
3.Embedded Methods: These approaches combine the best aspects of filter
and wrapper methods by implementing algorithms with built-in feature
selection capabilities. They’re faster than wrapper methods and more
accurate than filter methods.

Dr. Shalini Gambhir


2. Unsupervised Feature Selection
• Unsupervised feature selection techniques work with unlabeled data,
allowing to explore and discover important data characteristics without
using a target variable. These methods are particularly useful when we
don’t have labeled data or when we want to identify patterns and
similarities in our dataset.

Dr. Shalini Gambhir


Supervised Feature Selection Methods
Supervised feature selection techniques in machine learning aim to identify the most relevant features for predicting a target variable. These

methods can significantly enhance model performance, reduce overfitting, and improve interoperability.

Three main categories of supervised feature selection methods: filter-based, wrapper-based, and embedded approaches.

• 1) Filter-based Methods
• Filter methods evaluate the intrinsic properties of features using univariate
statistics, making them computationally efficient and independent of the machine
learning algorithm.

Dr. Shalini Gambhir


• Some popular filter-based methods

• 1.1) Information Gain


• Information gain measures the reduction in entropy by splitting a dataset
according to a given feature. It’s particularly useful for decision tree
algorithms and feature selection in classification tasks.

• 1.2) Chi-Squared Test


• The Chi-squared test is used to measure categorical features’ independence from
the target variable. Features with higher Chi-squared scores are considered more
relevant.

• 1.3) Fisher’s Score


• Fisher’s Score ranks features based on their ability to differentiate between
classes. It’s particularly useful for continuous features in classification problems.
Dr. Shalini Gambhir
• 1.4) Missing Value Ratio
• The Missing Value Ratio method removes features with a high percentage of
missing values, which may not contribute significantly to the model’s
performance.

Dr. Shalini Gambhir


1.1 Information Gain
• Information Gain (IG) is a measure used in decision trees and feature
selection to determine how well a given feature separates the target
classes. It is based on the concept of entropy, which quantifies
uncertainty or randomness in data.
• If a dataset has all instances belonging to a single class, it has zero
entropy (pure set).
• If a dataset is equally split between classes, it has maximum entropy
(high uncertainty).
• IG calculates the reduction in entropy when a feature is used to split
the dataset.
Dr. Shalini Gambhir
Information Gain Cont..

Dr. Shalini Gambhir


IG Example :

Dr. Shalini Gambhir


Dr. Shalini Gambhir
Dr. Shalini Gambhir
1.2 Chi-Squared Test
• The Chi-Squared test is used to determine whether a categorical
feature is independent of the target variable. If a feature is strongly
related to the target variable, it is useful for classification.

Dr. Shalini Gambhir


Chi-Squared cont..

Dr. Shalini Gambhir


Chi Squared Example:

Dr. Shalini Gambhir


Dr. Shalini Gambhir
Dr. Shalini Gambhir
1.3 Fisher’s Score

Dr. Shalini Gambhir


Dr. Shalini Gambhir
Dr. Shalini Gambhir
Dr. Shalini Gambhir
Fisher’s Score provides a ranking of features based on their ability to separate classes. In this case, Feature 1 is more relevant
than Feature 2 for classification. This technique is useful in feature selection before applying machine learning models.

Dr. Shalini Gambhir


1.4 Missing Value ratio

Dr. Shalini Gambhir


• 2) Wrapper-based Methods
• Wrapper methods evaluate subsets of features by training and testing a specific
machine-learning model. While computationally expensive, they often yield better
results than filter methods.

Dr. Shalini Gambhir


• 2.1) Forward Selection
• Forward selection starts with an empty feature set and iteratively adds features that
improve model performance the most.
• 2.2) Backward Selection
• Backward selection starts with all features and iteratively removes the least
significant ones.
• 2.3) Exhaustive Feature Selection
• Exhaustive feature selection evaluates all possible combinations of features to find
the optimal subset.
• 2.4) Recursive Feature Elimination
• Recursive Feature Elimination (RFE) recursively removes features, building
models with the remaining features at each step.

Dr. Shalini Gambhir


2.1 Forward Selection
• Forward Selection is a feature selection technique used in machine learning and statistics to build
a model by progressively adding features that contribute the most to improving model
performance. It is a type of greedy algorithm that starts with an empty feature set and keeps
adding features one by one based on their impact on the model.
• How Forward Selection Works
1. Start with an Empty Set
1. The algorithm begins with no features (i.e., an empty feature set).
2. Evaluate Each Feature Individually
1. Each feature is tested individually by training a model with just that feature and evaluating its performance
using a predefined metric (e.g., accuracy, R², or any other relevant score).
3. Add the Best Feature
1. The feature that provides the highest improvement in model performance is added to the feature set.
4. Repeat the Process
1. In each iteration, one more feature is added from the remaining pool of features.
2. The feature that contributes the most to improving the model's performance is chosen at each step.
5. Stopping Criterion
1. The process continues until:
1. Adding a new feature does not significantly improve performance.
2. A predefined number of features have been selected.
3. A threshold score (e.g., a certain accuracy level) is reached.

Dr. Shalini Gambhir


Forward Selection cont..
• Advantages of Forward Selection
• Computationally Efficient compared to evaluating all possible feature subsets.
• Interpretability: Features are selected based on their individual contributions.
• Avoids Overfitting: Reduces the risk of including irrelevant features.

• Disadvantages of Forward Selection


• Greedy Approach: Once a feature is added, it cannot be removed, even if a better combination exists later.
• Computational Cost: Still requires multiple model training iterations.
• Feature Dependencies Ignored: Some features might be useful only in the presence of others, which
forward selection might overlook.

• Use Cases
• Used in regression models (e.g., linear regression).
• Applied in machine learning classification problems.
• Useful in high-dimensional datasets where feature reduction is required.

Dr. Shalini Gambhir


2.2 Backward Selection
• Backward Selection, also known as Backward Elimination, is a feature selection technique used in
machine learning and statistical modeling. Unlike Forward Selection, which starts with an empty
feature set and adds features, Backward Selection begins with all available features and
systematically removes the least significant ones.
• How Backward Selection Works
1. Start with All Features
1. The model is initially trained using all available features.
2. Evaluate Feature Significance
1. The significance of each feature is determined using statistical tests (such as p-values in regression) or
performance metrics (like accuracy or R²).
3. Remove the Least Significant Feature
1. The feature with the least contribution to the model's performance is removed.
4. Retrain and Repeat
1. The model is retrained without the removed feature.
2. The process is repeated, eliminating the next least significant feature in each iteration.
5. Stopping Criterion
1. The process continues until:
1. Removing additional features does not improve performance.
2. A predefined number of features remain.
3. All remaining features are statistically significant.

Dr. Shalini Gambhir


Backward Selection cont..
• Advantages of Backward Selection
• Thorough Evaluation: Since it starts with all features, it considers feature interactions.
• More Reliable Than Forward Selection: It avoids the risk of missing important features
early in the selection process.

• Disadvantages of Backward Selection


• Computationally Expensive: Since it starts with all features, it requires more training
iterations, making it slower for large datasets.
• Assumes All Features Are Initially Useful: If the dataset contains many irrelevant
features, it may take longer to eliminate them.

• Use Cases
• Used in regression models (e.g., linear regression with p-values).
• Applied in machine learning to reduce dimensionality.
• Useful when dealing with a moderate number of features (not too large).
Dr. Shalini Gambhir
2.3 Exhaustive Feature Selection
• What is Exhaustive Feature Selection?
• Exhaustive Feature Selection is a comprehensive feature selection
technique that evaluates all possible combinations of features to
determine the optimal subset. Unlike Forward or Backward Selection,
which add or remove features iteratively, Exhaustive Selection tests every
feature combination to find the best-performing one.
• How Exhaustive Feature Selection Works
1.Generate All Possible Feature Combinations
1. The algorithm considers every possible subset of features, from a single feature to
all available features.
2.Train and Evaluate a Model for Each Combination
1. A model is trained and tested for each possible feature subset.
2. A predefined metric (e.g., accuracy, R², AUC-ROC) is used to measure performance.
3.Select the Best Subset
1. The feature subset that yields the highest model performance is selected.

Dr. Shalini Gambhir


Exhaustive Feature Selection Cont..
• Advantages of Exhaustive Feature Selection
• Guaranteed Best Subset: Since it evaluates all combinations, it finds the most optimal feature set.
• Thorough Feature Evaluation: It captures interactions between features that other methods might miss.

• Disadvantages of Exhaustive Feature Selection


• Computationally Expensive: The number of possible subsets grows exponentially with the number of
features:
2𝑛 -1
• where n is the number of features.
• Not Practical for Large Datasets: Due to the high computational cost, it is infeasible for datasets with many
features.

• Use Cases
• Suitable for small datasets with a limited number of features.
• Used in high-stakes applications where model accuracy is critical, such as healthcare and finance.
• Helpful when computational resources are not a constraint.
Dr. Shalini Gambhir
2.4 Recursive Feature Elimination (RFE)
• Recursive Feature Elimination (RFE) is an iterative feature selection
technique that removes the least significant features step by step, refining
the model at each iteration until the best subset of features is selected.
• How RFE Works
1.Train a Model on All Features
1. A machine learning model (e.g., linear regression, SVM, decision tree) is trained on
the full feature set.
2.Rank Feature Importance
1. The model assigns importance scores to each feature.
3.Remove the Least Important Feature(s)
1. The feature with the lowest importance score is removed.
4.Repeat Until Desired Number of Features is Reached
1. The process continues recursively until the specified number of features remains.

Dr. Shalini Gambhir


Recursive Feature Elimination (RFE) cont..
• Advantages of RFE
• Efficient for Medium-Sized Datasets: Compared to exhaustive selection, it is computationally less expensive.
• Works with Different Models: Can be used with any model that provides feature importance scores.
• Improves Model Generalization: By removing irrelevant features, it reduces overfitting.

• Disadvantages of RFE
• Computational Cost: More expensive than simple methods like Forward or Backward Selection.
• Feature Importance is Model-Dependent: The ranking depends on the choice of the model, which may lead
to different selections for different algorithms.

• Use Cases
• Commonly used in predictive modeling for selecting the most important features.
• Useful in scenarios where feature selection needs to be automated without exhaustive search.
• Works well for models like Support Vector Machines (SVM), Logistic Regression, and Decision Trees.

Dr. Shalini Gambhir


3) Embedded Approach
Embedded methods combine feature selection with the model training
process, offering a balance between computational efficiency and
performance.

Dr. Shalini Gambhir


• 3.1) Regularization
• Regularization techniques like Lasso (L1) and Ridge (L2) can be used for feature
selection by shrinking less important feature coefficients toward zero.
• 3.2) Random Forest Importance
• Random Forest algorithms provide feature importance scores based on how well
each feature improves the purity of node splits.

Dr. Shalini Gambhir


3.1 Regularization
• Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function, which
discourages the model from assigning too much importance to certain features.
• Types of Regularization for Feature Selection
1. Lasso (L1 Regularization)
1. Uses an L1 penalty that adds the absolute values of feature coefficients to the loss function.
2. Can shrink some feature coefficients to exactly zero, effectively removing them.
3. Best for feature selection when only a subset of features is needed.
2. Ridge (L2 Regularization)
1. Uses an L2 penalty that adds the squared values of feature coefficients to the loss function.
2. Shrinks coefficients towards zero but does not remove them completely.
3. Useful for reducing feature importance but not ideal for feature selection.
3. Elastic Net (Combination of L1 & L2)
1. Combines both Lasso and Ridge penalties.
2. Useful when features are highly correlated.
• Advantages of Regularization for Feature Selection
• Helps in automatic feature selection (especially Lasso).
• Improves model generalization by preventing overfitting.
• Works well with high-dimensional data.
• Use Cases
• Lasso regression is widely used in linear models for selecting the most important features.
• Regularization is applied in finance, healthcare, and textDr.
classification
Shalini Gambhirwhere feature selection is crucial.
3.2 Random Forest
• Random Forest is an ensemble learning method that builds multiple decision trees and aggregates their results. It provides feature
importance scores by measuring how much each feature contributes to improving the model's decision-making.
• How is Feature Importance Calculated?
• Random Forest assigns importance scores based on how well each feature improves node purity.
• Two main ways to determine feature importance:
• Mean Decrease in Impurity (Gini Importance)
• Measures how much each feature reduces impurity (Gini Index) across all decision trees.
• Permutation Importance
• Shuffles feature values and observes how model accuracy drops.
• Advantages of Random Forest for Feature Selection
• Works well with high-dimensional data
Handles non-linearity and feature interactions
Computationally efficient compared to exhaustive search methods
Useful for both classification and regression tasks
• Use Cases
• Medical Diagnosis: Identifying the most important biomarkers.
• Finance: Selecting the most predictive economic indicators.
• Natural Language Processing (NLP): Determining key words/features.

Dr. Shalini Gambhir


Unsupervised Feature Selection Techniques
• Unsupervised feature selection techniques allow to explore and discover important
data characteristics without using labeled data.
• These methods are particularly useful when dealing with high-dimensional
datasets and want to identify patterns and similarities without explicit instructions.
1.1) Principal Component Analysis (PCA)
• PCA is a powerful technique for dimensionality reduction that helps you
identify the most important features in your dataset. It works by finding the
principal components that capture the maximum variance in the data.

1.2) Independent Component Analysis (ICA)


• ICA is a technique that separates a multivariate signal into independent, non-
Gaussian components. It’s particularly useful when you want to identify the
sources of a signal rather than just the principal components.

Dr. Shalini Gambhir


1.3) Non-negative Matrix Factorization (NMF)
• NMF is a technique that decomposes a non-negative matrix into two non-negative matrices. It’s particularly
useful for text mining and image processing tasks.

1.4) T-distributed Stochastic Neighbor Embedding (t-SNE)


• t-SNE is a powerful technique for visualizing high-dimensional data in two or three dimensions. It’s
particularly useful for exploring similarities and patterns in complex datasets.

1.5) Autoencoder
• Autoencoders are neural networks that learn to compress and reconstruct data. The compressed representation
can be used for feature selection.

Dr. Shalini Gambhir


1.1 Principal Component Analysis (PCA)
• What is PCA?
• PCA is a dimensionality reduction technique used in machine learning and statistics.
• It transforms high-dimensional data into a lower-dimensional space while preserving
maximum variance.
• Common applications include data visualization, noise reduction, and feature
extraction.

• Benefits of PCA
• Reduces Dimensionality – Helps in handling large datasets.
• Removes Redundancy – Eliminates correlated features.
• Improves Computation Speed – Useful in ML models.
• Enhances Visualization – Converts high-dimensional data into 2D or 3D for better
interpretation.

Dr. Shalini Gambhir


Steps of PCA
1. Standardizing the Data
• Normalize all features to have mean = 0 and standard deviation = 1.
• Formula:
X' = (X - μ) / σ
where X is the original dataset, μ is the mean, and σ is the standard deviation.
2. Compute the Covariance Matrix
• The covariance matrix helps understand relationships between variables.
• Formula for an n × n covariance matrix:
C = (1/m) * (X')^T * X'
where m is the number of samples.
3. Compute Eigenvalues and Eigenvectors
• Solve the eigenvalue equation:
C*v=λ*v
• Eigenvalues (λ) represent the variance captured by each principal component.
• Eigenvectors (v) define the directions of the principal components.
4. Select the Top Principal Components
• Sort eigenvectors based on eigenvalues in descending order.
• Choose the top k eigenvectors that retain the most variance.
5. Transform the Data
• Project the original dataset onto the selected principal components:
X_new = X' * V_k
where V_k contains the top k eigenvectors.
• This results in a lower-dimensional representation of the data.

Dr. Shalini Gambhir


1.2 Independent Component Analysis (ICA)
• What is ICA?
• ICA is a statistical technique used for blind source separation.
• It finds independent signals from a mixture of signals by maximizing statistical
independence.
• Common applications include audio processing (e.g., separating voices in a recording),
biomedical signal analysis (e.g., EEG data), and image processing.

• Benefits of ICA
• Separates Mixed Signals – Used in noise removal and feature extraction.
• Enhances Data Interpretability – Useful in medical and financial applications.
• Removes Redundant Information – Makes data analysis more efficient.

• ICA is widely used in speech processing, EEG signal analysis, and financial data
modeling

Dr. Shalini Gambhir


Steps of ICA
• 1. Centering and Whitening the Data
• Centering: Subtract the mean from each feature to ensure zero mean.
• Whitening: Transform data to have unit variance using Principal Component Analysis
(PCA).
• 2. Define an Independence Criterion
• ICA assumes that source signals are statistically independent.
• Common criteria:
• Minimizing mutual information
• Maximizing non-Gaussianity (e.g., using Kurtosis or Negentropy)
• 3. Apply an Iterative Algorithm
• Popular ICA algorithms include:
• FastICA (Fast Independent Component Analysis)
• Infomax ICA
• These algorithms adjust weights to separate independent components iteratively.
• 4. Extract Independent Components
• The transformed data represents independent components that correspond to the
original hidden sources.
Dr. Shalini Gambhir
1.3 Non-negative Matrix Factorization (NMF)
• What is NMF?
• NMF is a matrix decomposition technique that factors a non-negative matrix
into two lower-dimensional non-negative matrices.
• It is used in dimensionality reduction, feature extraction, and data compression.
• Unlike PCA and ICA, NMF enforces non-negativity, making it ideal for
interpretability in applications like topic modeling and image processing.

• Benefits of NMF
• Enhances Interpretability – Outputs are easily understandable.
• Sparse Representations – Captures essential features with reduced redundancy.
• Used in Various Applications – Topic modeling, image processing, bioinformatics.

NMF is widely used in recommender systems, text mining, and signal processing.

Dr. Shalini Gambhir


1.4 T-distributed Stochastic Neighbor Embedding
(t-SNE)
• What is t-SNE?
• t-SNE is a non-linear dimensionality reduction technique used for
visualizing high-dimensional data in 2D or 3D.
• It preserves the local structure of data by mapping similar points
closer together in lower dimensions.
• Applications of t-SNE
• Visualizing high-dimensional datasets in machine learning
• Understanding word embeddings (NLP)
• Analyzing genomic and biomedical data
• t-SNE is widely used in data exploration and pattern discovery
Dr. Shalini Gambhir
1.5 Autoencoder
• What is an Autoencoder?
• An unsupervised neural network used for dimensionality reduction, feature learning, and data compression.
• It consists of two parts:
• Encoder: Compresses input data into a lower-dimensional latent space.
• Decoder: Reconstructs the original data from the compressed representation.

• Architecture of an Autoencoder
1. Input Layer → Takes the raw data (e.g., images, text).
2. Encoder → Maps input to a lower-dimensional representation.
3. Bottleneck (Latent Space) → Captures the most important features.
4. Decoder → Reconstructs the original data from the latent space.
5. Output Layer → Produces a reconstructed version of the input.

• Applications of Autoencoders
• Image Denoising & Compression
• Anomaly Detection (Fraud, Medical Imaging)
• Data Generation (Variational Autoencoders - VAEs)

• Autoencoders are powerful tools for learning compact data representations

Dr. Shalini Gambhir

You might also like