0% found this document useful (0 votes)

25 views

CS 464 Introduction To Machine Learning: Feature Selection

1. Feature selection is the task of identifying the most useful features for predicting an outcome variable. This can improve accuracy, generalizability, interpretability and efficiency of models. 2. There are three main approaches to feature selection: filtering, wrapper methods, and embedding selection in model training via regularization. 3. Filtering methods score and rank features independently of model training based on criteria like mutual information and chi-square tests, then select top features. This is a common approach for text classification.

Uploaded by

Mathias Bueno

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

CS 464 Introduction To Machine Learning: Feature Selection

Uploaded by

Mathias Bueno

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CS 464

Introduction to Machine
Learning

Feature Selection

(Slides based on material by Mehmet Koyutürk, Öznur

Taştan and Mark Craven)
Feature Selection
• The objective in classification/regression is to learn
a function that relates values of features to values
of outcome variable(s)
– Often, we are presented with many features
– Not all of these features are relevant

• Feature Selection is the task of identifying an

“optimal” (take this in lay language) set of features
that are useful for accurately predicting the
outcome variable
Motivation for Feature Selection
• Accuracy
– Getting rid of irrelevant features can help learn better
predictive models by reducing confusion
• Generalizability
– Models with less features have lower complexity, so they
are less prone to overfitting
• Interpretability
– Identifying a small set of features can help understand the
mechanistics of the relationship between the features and
the outcome variable(s)
• Efficiency
– With smaller number of features, learning and prediction
may take less time/space
Three Main Approaches
1. Treat feature selection as a separate task
• Filtering-based feature selection
• Wrapper-based feature selection

2. Embed feature selection into the task of learning a

model
• Regularization

3. Do not select features, instead construct new

features that effectively represent combinations
original features
• Dimensionality reduction
Feature Selection as a Separate Task
Filtering

Rank
Score Features Select Top Train
Features Based on k Features Model
Score
• k can be chosen heuristically
• Scores do not represent • Standard rules of thumb can be
prediction performance used to set a threshold (e.g.,
since no validation is use features with statistically
done at this stage significant scores)
• Do NOT use • Can use cross-validation to
validation/test samples to select an optimal value of k
compute score (using prediction performance
as the criterion)
Scoring Features for Filtering
• Mutual information
– Reduction in uncertainty on the value of the outcome variable
upon observation of the value of feature
– Already discussed

• Statistical tests
– t-statistic: Standardized difference of the mean value of the
feature in different classes (continuous features)
– Chi-square statistic: Difference between counts in different
classes (discrete features, related to mutual information)

• Variance/frequency
– Continuous features with low variance are usually not useful
– Discrete features that are too frequent or too rare are usually
not useful
Feature Selection – In Text Classification
• In text classification, we usually represent documents with a
high-dimensional feature vector:
• Each dimension corresponding to a term
• Many dimensions correspond to rare words
• Rare words can mislead the classifier

• Rare misleading features are called noise features

• Eliminating noise features from the representation increases

efficiency and effectiveness of text classification

40
Noisy Features
• A noise feature is one that increases the classification error on
new data.

• Suppose you are doing topic classification. One class is China

• A rare term, say arachnocentric, has no information about

documents about China, but all instances of arachnocentric in the
training data happen to occur in the documents related to China

• The learner might produce a classifier that misassigns test

documents containing arachnocentric to China.

• Such an incorrect generalization from an accidental property of

the training set is an example of overfitting
9
Feature Selection
• All possible feature subsets 2^N combinations.

• If you fix the feature subset size to M

• This number of combinations is unfeasible, even for

moderate M

• A search strategy is therefore needed to direct the

feature selection process as it explores the space of
all possible combination of features

10
Filtering-Based Selection
• Use a simple measure to assess the relevance of
each feature to the outcome variable (class)
• Mutual information – reduction in the uncertainty in class
upon observation of the value of the feature
• Chi-square test -a statistical test that compares the frequencies
of a term between different classes

• Rank features, try models that include the top k

features as you increase k

• These methods are based on the rationale:

– good feature subsets contain features highly correlated
with (predictive of) the class
43
11
Information
• Information: reduction in uncertainty (amount of surprise in
the outcome)

1
I(E) = log 2
I(X=x) = -log 2 p(x)
p(x)
• If the probability of this event happening is small and it
happens the information is large:

Observing the outcome of a coin flip is head

I = -log2 1/ 2 =1
The outcome of a dice is 6
I = -log2 1/ 6 = 2.58

12
Entropy
• The entropy of a random variable is the sum of the
information provided by its possible values, weighted by the
probability of each value
• Entropy is a measure of uncertainty

The summation is over all

possible values of the random
variable

The entropy of a binary random variable

as a function of the probability of a success
Mutual Information
• Mutual information I(X,Y) is the reduction of uncertainty in
one variable upon observation of the other variable
• Mutual information is a measure of statistical dependency between
two random variables
Mutual Information
• The mutual information between feature vector and class
label measures the amount by which the uncertainty in the
class is decreased by knowledge of the feature. Compute the
mutual information (MI) of term t and class c.
• Below U is a random variable that takes values (the
document contains term ) and (the document does not
contain )
• C is a random variable that takes values (the document is in
class ) and (the document is not in class ).

§§ Definition:

45
Mutual Information
• If a term’s occurrence is independent of the class (ie.
term’s distribution is the same in the class as it is in the
collection as a whole), then MI is 0

• MI is maximum if the term is a perfect indicator for class

membership (ie. the term is present in a document if and
only if the document is in the class)

16
Mutual Information Example
• class poultry and the term export

• The counts of the number of documents with the four possible

combinations of indicator values are as follows

• 𝑁!" : number of documents that contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 0)
• 𝑁!! : number of documents that contain 𝑡 (𝑒# = 1) and are in c (𝑒$ = 1)
• 𝑁"! : number of documents that do not contain 𝑡 (𝑒# = 0) and are in c 𝑒$ = 1
• 𝑁"" : number of documents that do not contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 1)

48
17
How to compute Mutual Information
• Based on maximum likelihood estimates, the formula we
actually use is:

47
18
Mutual Information Example

48
19
Chi-square statistic
• The Chi-square test is applied to test the
independence of two events, where two events A
and B are defined to be independent if
• P(AB) = P(A)P(B) or, equivalently,
• P(A|B) = P(A) and P(B|A) = P(B).

• The two events are occurrence of the term and

occurrence of the class. We rank the terms with
respect to the following quantity with dataset D,
term t and class c.
(*#! #" +,#! #" )$
• 𝐶ℎ𝑖 ! (𝐷, 𝑡, 𝑐) = ∑"!∈{%,'} ∑""∈{%,'} ,#! #"
Chi-square statistic
(𝑁"!"" − 𝐸"!"" )!
𝐶ℎ𝑖 ! (𝐷, 𝑡, 𝑐) = , ,
𝐸"!""
"! ∈{%,'} "" ∈{%,'}

• 𝐸!! : is the expected frequency of t and c occurring together in a document

assuming that term and class are independent.
• N is the observed frequency in D and E the expected frequency.
Chi-square statistic

Measure of how much expected counts E

and observed counts N deviate from each
other.
Frequency based

• selecting the terms that are most common in the class

• Document frequency : the number of documents in the

class c that contain the term t. -> More appropriate for
binomial model

• Collection frequency: the number of tokens of t that occur

in documents in c. - > More appropriate for Multinomial
model
Why Feature Selection Helps

50
t-statistic
• We have n1 and n2 samples from each class, respectively

• For each feature, let x1 , s1 be the sample mean and variance of

the first class, x2, s2 be that of the second

• The distribution of t approaches from uniform to normal distribution

as number of samples grow
• We can set a threshold on the t-statistic for a feature to be selected
based on the t-distribution
Wrapper Methods
• Frame the feature selection task as a search
problem

• Evaluate each feature set by using the prediction

performance of the learning algorithm on that
feature set
– Cross-validation

• How to search the exponential space of feature

sets?
Searching for Feature Sets

state = set of features

start state = empty (forward selection)
or full (backward elimination)

operators
add/subtract a feature

scoring function
cross-validation accuracy using learning method on a
given state’s feature set
Forward Selection

Given: feature set 𝑋- , … , 𝑋. , training set 𝐷, learning

method 𝐿.

𝐹 ← {} Scores feature set G by learning

model(s) with L and assessing
While score of 𝐹 is improving its(their accuracy.)
for i ← {} to 𝑛 do
if 𝑋- ∉ 𝐹
𝐺- ← 𝐹 ∪ 𝑋-
𝑆𝑐𝑜𝑟𝑒- = Evaluate(𝐺- , 𝐿, 𝐷)
F ← 𝐺/ with best 𝑆𝑐𝑜𝑟𝑒/
return feature set 𝐹.
Forward Selection
Backward Elimination
Forward Selection vs. Backward Elimination

• Both use a hill-climbing search

Forward selection Backward Elimination

Efficient for choosing a small Efficient for discarding a

subset of the features small subset of features.

Misses features whose Preserves features whose

usefulness requires other usefulness requires other
features. features.
Embedded Methods (Regularization)

• Instead of explicitly selecting features, bias the

learning process towards using a small number
of features

• Key idea: objective function has two parts

• Term representing error minimization (model fit)
• Term that “shrinks” parameters toward 0
Ridge Regression
• Linear regression:
'

𝑓 𝑥 = 𝑤" + 3 𝑥% 𝑤%
%&!

𝐸 𝑤 = 3(𝑦 ( − 𝑓(𝑥 (() ))-

()*
'

E(x) = 3(𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% )-

()* %&!

• Penalty term (L2 norm of the coefficients) added:

6 ! 6

E x =, 𝑦 1
− 𝑤% − , 𝑥41 𝑤4 + 𝜆 , 𝑤4!
123 45' 45'
LASSO
• Ridge regression shrinks the weights, but does not
necessarily reduce the number of features
– We would like to force some coefficients to be set to 0

• Add L1 norm of the coefficients as the penalty term:

E x
' - '

=3 𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% + 𝜆 3 |𝑥% |

()* %&! %&!

– Why does this result in more coefficients to be set to 0,

effectively performing feature selection?
Ridge Regression vs. LASSO

Plot of the contours of the unregularized error function (red) along with the
constraint region for lasso (left) and ridge (right). 𝛽’s are the weights we
learn.
Generalizing Regularization
• L1 and L2 penalties can be used with other learning
methods (logistic regression, neural nets, SVMs,
etc.)
– Both can help avoid overfitting by reducing variance
• There are many variants with somewhat different
biases
– Elastic net: includes L1 and L2 penalties
– Group Lasso: bias towards selecting defined groups of
features
– Graph Lasso: bias towards selecting “adjacent” features
in a defined graph

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6129)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (935)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8215)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2923)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4972)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2061)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1987)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1993)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)
Week 1 Sol Merged
No ratings yet
Week 1 Sol Merged
39 pages
Force Reconstruction: Analysis and Regularization of A Deconvolution Problem
No ratings yet
Force Reconstruction: Analysis and Regularization of A Deconvolution Problem
27 pages
DL_Unit-3
No ratings yet
DL_Unit-3
56 pages
Applied Machine Learning Course Schedule: Topic
No ratings yet
Applied Machine Learning Course Schedule: Topic
29 pages
Deep Learning - IIT Ropar - Unit 10 - Week 7
100% (1)
Deep Learning - IIT Ropar - Unit 10 - Week 7
4 pages
CM412_DL_Model Paper
No ratings yet
CM412_DL_Model Paper
5 pages
Ali Geo
No ratings yet
Ali Geo
19 pages
Deep Geometric Prior For Surface Reconstruction
No ratings yet
Deep Geometric Prior For Surface Reconstruction
13 pages
Report Practical PR
No ratings yet
Report Practical PR
13 pages
Performance Metrics Classification (1)
No ratings yet
Performance Metrics Classification (1)
39 pages
The Cross Entropy Method For Classification
No ratings yet
The Cross Entropy Method For Classification
8 pages
(2017) Formal Guarantees On The Robustness of A Classifier Against Adversarial Manipulation
No ratings yet
(2017) Formal Guarantees On The Robustness of A Classifier Against Adversarial Manipulation
21 pages
Social Prediction: A New Research Paradigm Based On Machine Learning
No ratings yet
Social Prediction: A New Research Paradigm Based On Machine Learning
21 pages
Functional Matrix Factorizations For Cold Start Recommandation
No ratings yet
Functional Matrix Factorizations For Cold Start Recommandation
10 pages
Module 2 Quiz - Correct
No ratings yet
Module 2 Quiz - Correct
4 pages
Offensive Comment Detection Using Zero-Shot Learning: Nikhil Chilwant
No ratings yet
Offensive Comment Detection Using Zero-Shot Learning: Nikhil Chilwant
50 pages
Pyrespect: A Computer Program To Extract Discrete and Continuous Spectra From Stress Relaxation Experiments
No ratings yet
Pyrespect: A Computer Program To Extract Discrete and Continuous Spectra From Stress Relaxation Experiments
24 pages
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
No ratings yet
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
20 pages
Curriculum of Ms in Avionics Engineering
No ratings yet
Curriculum of Ms in Avionics Engineering
34 pages
Super Resolution: Ms. Manisha A. Bhusa
No ratings yet
Super Resolution: Ms. Manisha A. Bhusa
5 pages
CE 2019 Datesheet
No ratings yet
CE 2019 Datesheet
14 pages
Image Restoration - Fundamentals and Advances
No ratings yet
Image Restoration - Fundamentals and Advances
376 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
Texture Learning Domain Randomization For Domain Generalized Segmentation
No ratings yet
Texture Learning Domain Randomization For Domain Generalized Segmentation
18 pages
Project 3 Q&A: Jonathan Krause
No ratings yet
Project 3 Q&A: Jonathan Krause
58 pages
Regularization
No ratings yet
Regularization
5 pages
15 Types of Regression You Should Know
No ratings yet
15 Types of Regression You Should Know
30 pages
Guarding Barlow Twins Against Overfitting With Mixed Samples
No ratings yet
Guarding Barlow Twins Against Overfitting With Mixed Samples
17 pages
What Is LASSO Regression Definition, Examples and Techniques
No ratings yet
What Is LASSO Regression Definition, Examples and Techniques
15 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages

CS 464 Introduction To Machine Learning: Feature Selection

Uploaded by

CS 464 Introduction To Machine Learning: Feature Selection

Uploaded by

CS 464

(Slides based on material by Mehmet Koyutürk, Öznur

• Feature Selection is the task of identifying an

2. Embed feature selection into the task of learning a

3. Do not select features, instead construct new

• Rare misleading features are called noise features

• Eliminating noise features from the representation increases

• Suppose you are doing topic classification. One class is China

• A rare term, say arachnocentric, has no information about

• The learner might produce a classifier that misassigns test

• Such an incorrect generalization from an accidental property of

• If you fix the feature subset size to M

• This number of combinations is unfeasible, even for

• A search strategy is therefore needed to direct the

• Rank features, try models that include the top k

• These methods are based on the rationale:

Observing the outcome of a coin flip is head

The summation is over all

The entropy of a binary random variable

• MI is maximum if the term is a perfect indicator for class

• The counts of the number of documents with the four possible

• The two events are occurrence of the term and

• 𝐸!! : is the expected frequency of t and c occurring together in a document

Measure of how much expected counts E

• selecting the terms that are most common in the class

• Document frequency : the number of documents in the

• Collection frequency: the number of tokens of t that occur

• For each feature, let x1 , s1 be the sample mean and variance of

• The distribution of t approaches from uniform to normal distribution

• Evaluate each feature set by using the prediction

• How to search the exponential space of feature

state = set of features

Given: feature set 𝑋- , … , 𝑋. , training set 𝐷, learning

𝐹 ← {} Scores feature set G by learning

• Both use a hill-climbing search

Forward selection Backward Elimination

Efficient for choosing a small Efficient for discarding a

Misses features whose Preserves features whose

• Instead of explicitly selecting features, bias the

• Key idea: objective function has two parts

𝐸 𝑤 = 3(𝑦 ( − 𝑓(𝑥 (() ))-

E(x) = 3(𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% )-

• Penalty term (L2 norm of the coefficients) added:

• Add L1 norm of the coefficients as the penalty term:

=3 𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% + 𝜆 3 |𝑥% |

– Why does this result in more coefficients to be set to 0,

You might also like