CS 464 Introduction To Machine Learning: Feature Selection
CS 464 Introduction To Machine Learning: Feature Selection
Introduction to Machine
Learning
Feature Selection
Rank
Score Features Select Top Train
Features Based on k Features Model
Score
• k can be chosen heuristically
• Scores do not represent • Standard rules of thumb can be
prediction performance used to set a threshold (e.g.,
since no validation is use features with statistically
done at this stage significant scores)
• Do NOT use • Can use cross-validation to
validation/test samples to select an optimal value of k
compute score (using prediction performance
as the criterion)
Scoring Features for Filtering
• Mutual information
– Reduction in uncertainty on the value of the outcome variable
upon observation of the value of feature
– Already discussed
• Statistical tests
– t-statistic: Standardized difference of the mean value of the
feature in different classes (continuous features)
– Chi-square statistic: Difference between counts in different
classes (discrete features, related to mutual information)
• Variance/frequency
– Continuous features with low variance are usually not useful
– Discrete features that are too frequent or too rare are usually
not useful
Feature Selection – In Text Classification
• In text classification, we usually represent documents with a
high-dimensional feature vector:
• Each dimension corresponding to a term
• Many dimensions correspond to rare words
• Rare words can mislead the classifier
40
Noisy Features
• A noise feature is one that increases the classification error on
new data.
10
Filtering-Based Selection
• Use a simple measure to assess the relevance of
each feature to the outcome variable (class)
• Mutual information – reduction in the uncertainty in class
upon observation of the value of the feature
• Chi-square test -a statistical test that compares the frequencies
of a term between different classes
1
I(E) = log 2
I(X=x) = -log 2 p(x)
p(x)
• If the probability of this event happening is small and it
happens the information is large:
12
Entropy
• The entropy of a random variable is the sum of the
information provided by its possible values, weighted by the
probability of each value
• Entropy is a measure of uncertainty
§§ Definition:
45
Mutual Information
• If a term’s occurrence is independent of the class (ie.
term’s distribution is the same in the class as it is in the
collection as a whole), then MI is 0
16
Mutual Information Example
• class poultry and the term export
• 𝑁!" : number of documents that contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 0)
• 𝑁!! : number of documents that contain 𝑡 (𝑒# = 1) and are in c (𝑒$ = 1)
• 𝑁"! : number of documents that do not contain 𝑡 (𝑒# = 0) and are in c 𝑒$ = 1
• 𝑁"" : number of documents that do not contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 1)
48
17
How to compute Mutual Information
• Based on maximum likelihood estimates, the formula we
actually use is:
• 𝑁!" : number of documents that contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 0)
• 𝑁!! : number of documents that contain 𝑡 (𝑒# = 1) and are in c (𝑒$ = 1)
• 𝑁"! : number of documents that do not contain 𝑡 (𝑒# = 0) and are in c 𝑒$ = 1
• 𝑁"" : number of documents that do not contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 1)
• N = 𝑁"" + 𝑁"! + 𝑁!" + 𝑁!!
47
18
Mutual Information Example
48
19
Chi-square statistic
• The Chi-square test is applied to test the
independence of two events, where two events A
and B are defined to be independent if
• P(AB) = P(A)P(B) or, equivalently,
• P(A|B) = P(A) and P(B|A) = P(B).
50
t-statistic
• We have n1 and n2 samples from each class, respectively
operators
add/subtract a feature
scoring function
cross-validation accuracy using learning method on a
given state’s feature set
Forward Selection
𝑓 𝑥 = 𝑤" + 3 𝑥% 𝑤%
%&!
E x =, 𝑦 1
− 𝑤% − , 𝑥41 𝑤4 + 𝜆 , 𝑤4!
123 45' 45'
LASSO
• Ridge regression shrinks the weights, but does not
necessarily reduce the number of features
– We would like to force some coefficients to be set to 0
Plot of the contours of the unregularized error function (red) along with the
constraint region for lasso (left) and ridge (right). 𝛽’s are the weights we
learn.
Generalizing Regularization
• L1 and L2 penalties can be used with other learning
methods (logistic regression, neural nets, SVMs,
etc.)
– Both can help avoid overfitting by reducing variance
• There are many variants with somewhat different
biases
– Elastic net: includes L1 and L2 penalties
– Group Lasso: bias towards selecting defined groups of
features
– Graph Lasso: bias towards selecting “adjacent” features
in a defined graph