0% found this document useful (0 votes)
5 views49 pages

Jdavis Advice

Uploaded by

Riccardo Forte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views49 pages

Jdavis Advice

Uploaded by

Riccardo Forte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

1

ADVICE ABOUT PRACTICAL


ASPECTS OF ML

Jesse Davis
Goals of this Lecture: Address Practical Aspects of
Machine Learning
2

 Massaging the data for better performance

 Discussing how to set up an appropriate empirical evaluation

 Identifying potential pitfalls

 At high level a bunch of stuff I wish I knew for


 Performing academic empirical evaluations
 Dealing with real-world “applied” tasks
3 Part I: Selecting Features
Dimensionality Reduction
4

 Represent data with fewer dimensions! ☺

 Effectively: Alter the given feature space

 Two broad ways


 Construct new feature space
 Simple drop dimensions in given space
Why Dimensionality Reduction
5

 Easier learning – fewer parameters


 |Features| ≫ |training examples| ??
 Better visualization
 Hard to understand more than 3D or 4D
 Discover “intrinsic dimensionality” of data
 High dimensional data may truly be low dimensional
 More interpretable models
 Interested in which features are relevant for task
 Improve efficiency
 Fewer features = less memory / runtime
Don’t Some Algorithms Do This?
6

 Decision trees:
 Selectmost promising feature at each node
 Tree only contains a subset of features

 Problem: Irrelevant attributes can degrade performance due to


data fragmentation
 Datasplit into smaller and smaller sets
 Even random attribute can look good with little data by chance

 More data does not help


Principal Component Analysis
7

 First principal component:


Direction of the largest variance x2
u1

 Each subsequent principal component:


 Orthogonal to the previous ones, and

 Directions of the largest variance of


the residuals

x1

Big Idea: Rotate the axes and drop irrelevant ones!


Eigenfaces [Turk, Pentland ’91]
8

Input images:
 N images

 Each 5050 pixels

 2500 features

Misleading figure.
Best to think of as an N  2500 matrix: |Examples|  |Features|
Reduce Dimensionality 2500 → 15
9

First principal component


Average
face

Other
components
Problematic Data Set for PCA
10

PCA cannot capture NON-LINEAR structure!


PCA Conclusions
11

 PCA
 Rotate the axes and sort new dimensions in order of “importance”
 Discard low significance dimensions
 Uses:
 Get compact description
 Ignore noise
 Improve classification (hopefully)
 Not magic:
 Doesn’t know class labels
 Can only capture linear variations

 One of many tricks to reduce dimensionality!


Feature Selection:
12
Two Approaches
Filtering-Based Wrapper-Based
Feature Selection Feature Selection
all features
all FS algorithm
FS algorithm features calls ML
algorithm
Score and rank each many times,
feature: Pick top k uses it to help
model select features
ML algorithm
ML algorithm
model
Filter-Based Approaches
13

 Idea: Measure each feature’s usefulness in isolation (i.e.,


independent of other features)

 Pro: Very fast so scales to large feature sets or large data sets

 Cons
 Misses feature interactions
 May select many redundant feature
Approach 1: Correlation
14

Gain(S,A) = Entropy(S) - Σ(|Sv| / |S|) Entropy(Sv)


v  Values(A)

cov( fi , y)
R ( fi , y ) =
var( fi ) var( y)

 (f )( y )
m
k =1 k,i − fi k −y
R( f i , y) =
 (f )  (y )
m 2 m 2

k =1 k ,i
− fi k =1 k
−y
Approach 2: Single Variable Classifier
15

 Select variable according to individual predictive performance

 Build classifier with just one variable


 Discrete:
Decision stump
 Continuous: Threshold the variable value

 Measure performance using accuracy, balanced accuracy, AUC,


etc.
Wrapper-Based Feature Selection
16

 Feature selection = search


 State = set of features
 Start state
 Forwardselection: Empty
 Backward elimination: Full

 Operators:
 Forward:add a feature
 Backward: subtract a feature

 Scoring function: Learned model’s performance on


training/tuning/ CV on the state’s feature set
Forward Feature Selection
17

Greedy search (aka “Hill Climbing”)


{}
50%

{F1} {F2} {Fd}


...
62% 72% 52%

add F3

{F1,F2} {F2,F3} ... {F2,Fd}


74% 73% 84%
Backward Feature Selection
18

Greedy search (aka “Hill Climbing”)


{F1,…,Fd}
75%
subtract F2

{F2,…,Fd} {F1, F3,…,Fd} ... {F1,…,Fd-1}


72% 82% 78%

subtract F3

{F3,…,Fd} {F1, F4,…,Fd} ... {F1, F3,…,Fd-1}


80% 83% 81%
Forward vs. Backward Selection
19

Forward Backward
 Faster in early steps because  Fast for choosing all but a
fewer features to test small subset of the features

 Fast for choosing a small


 Preserves features whose
subset of the features
usefulness requires other
features (e.g., area requires
 Misses features whose both length & width)
usefulness requires other
features (feature synergy)
Impact of feature selection on classification of
20
fMRI data [Pereira et al. ’05]
Feature Selection vs. Dimensionality Reduction
21

 Feature selection: Project to a lower dimensional subspace


perpendicular to removed feature
 Dimensionality reduction: allow other kinds of projection
Project onto
x2 x2 rotated axes

Drop x2

x1 x1
Feature Selection in Practice
22

 You cannot globally select the best features


 Thisis cheating
 Data leakage from test set to training set

 Results would be overoptimistic

 Feature selection must be performed separately for each fold

 Implication: Each fold could have a different feature set


23 Advice for Evaluation
Empirical Evaluation: Think about What You Want to
Demonstrate
24

 Many relevent questions


 Do we beat competitors?
 Are we more data efficient than competitions?
 Are we faster than the competition?

 Good practices:
 Pose a question / hypothesis and answer it
 Also include a naive baseline such as
◼ Always predict majority class
◼ Return mean value in training data
Case Study: RPE for Professional Soccer
25
Players
1.20
Given: GPS and
accelerometer data from a 1.00
player’s training session Train
set
Predict: Player’s Rate of 0.80 average
Neural
Perceived Exertion Net

MAE
0.60
LASSO
Question: Is model valid 0.40

across seasons?
0.20

0.00
Results: Is an Individual Model More Accurate Than
a Team Model?
26

0.90

0.85
Mean Absolute Error

0.80

0.75

0.70

0.65
Individual Team
Neural Net Boosted Tree LASSO

Lower value is better


How Does Amount of Data Affect Performance?
27

0.50
0.45
0.40
0.35
0.30
AUCPR

0.25
0.20
0.15
0.10
0.05
0.00
1 2 3
Number of Training Databases

TODTLER DTM LSM Random

Learning curve: Show performance as a function


of the amount of training data
Case Study: Activity Recognition
28

 Given: 3D accelerometer data from a phone


 Predict: Person’s activity (walking, ascending stairs, descending
stairs, cycling, jogging)
 Hypothesis: Deriving new signals will help
 Setup: Simulate different attachments by
rotating axes
 Approaches compared:
 TSFuse + GBT
 TSFresh + GBT (Time series features, but no fusion)
 RNN (LSTM)
Results Activity Recognition
29

TSFuse
TSFresh
RNN
Case Study: Energy Efficient Prediction
30

 Motivation: Learned models often deployed on devices with


resource constraints (e.g., battery)
 Question: How does feature selection strategy affect performance?
 Static
selection: Always consider k feature
 Dynamic selection: May ignore some features

 Approach: Fix max feature budget


RCV: Speedup and Weighted Accuracy vs.
Feature Budget
31

6.00
Speedup Factor Our approach:
5.00
4X more predictions
4.00
on resource budget
3.00
2.00
1.00
0.00
Δ Weighted Accuracy

0.02

0.01
IG
0.00
ΔCP
-0.01
0 200 400 600 800 1000
Feature Budget
Comparing Run Times Is A Dark Art
32

 What to measure: Wall clock or CPU time?


 Be sure to run everything on identically configured machines
 Should you include time to tune models?
 Easy to manipulate
 Also very relevant…

 Differences due to
 Programming languages
 How optimized the code is (definitely relevant)
Evaluate Design Decision: Ablasion or Lesion Study
33

 When designing your algorithm / model you make lots of decision


choices
 Which features
 Which normalizations

 Which functionality

 Ablative analysis tries to explain the difference between some


baseline (much poorer) performance and current performance
 Remove aspects of system and measure effect on performance
Case Study: Fatigue Protocol Data
34
Rating of perceived exertion (RPE): 6 – 20
Upper
arm

Wrist
Both
Tibia

Given: IMU data from a runner


Predict: Current fatigue level
Pre-processing: Normalizations Based on Domain
Knowledge
35

RPE evolution is trial-dependent: Normalize to first value

Normalize based on change from first windows


Domain Insight: Change in feature values over time is key
Effects of Feature Normalization for
Gradient Boosted Trees
36

No learning baselines:
3.50 Constant predictions

3.00
Median RPE
2.50
MAE of RPE

Personalized
2.00
Median
1.50 No Normalization

1.00 Normalization

0.50

0.00
Case Study: Resource Monitoring
37

Maintenance Univariate measurement: Abnormally high


usage Patterns Sampled every 5 minutes

Given: Real water usage data from a retail store


Do: Detect high periods of usage

Approach: Semi-supervised learning


 Simple statistical features, day of week, etc.

 Above features plus learned shape patterns


Results: Anomaly Detection for
Water Usage
38

Area under ROC Curve

Simple Features
Simple Features + Learned Patterns

Time of Data
39 Potential Problems or Pitfalls
Cross Validation Errors
40

 Must repeat entire data processing pipeline on every fold of


cross-validation using only that fold’s TRAINING DATA
 E.g.,
cannot do preprocessing over entire data set (feature selection,
parameter tuning, etc.

 Did I tweak my algorithm a million times until I get good results?


 Solution:Use one or two datasets for development, then expand
evaluation

 Temporal dependencies in the data?


Temporal Data Is Trickier!
41

 Setting: One season of data from training sessions from a professional


football team:
Season start Season end

Training: first 80% of data Testing: last 20% of data


 Predict adverse drug reactions
Adverse
Patient’s history First Prescription
Reaction?

Training data Censoring window


Class Imbalance
42

 Real-world problems: Often more examples of one class


(negatives) than the other (positives)

 One class rare: Anomaly detection, cancer, goals in a soccer


match, etc.

 This causes difficulties for learners: Hard to beat always


predicting the majority class!
Idea 1: Sampling
43

 Oversample the minority class: May lead to overfitting

 Undersample the majority class: Odd to throw away data

 SMOTE: Generate synthetic minority examples


 Find nearest neighbors
 Interpolate between them

- -
+ - +
+ - +
+ - -- - +
Synthetic - - -
example
Idea 2: Manipulate the Learner
44

 Change the cost function: Penalize mistakes on minority class


more heavily

 Optimize towards something that is better at capturing skew


0.5 × 𝑇𝑃 0.5 × 𝑇𝑁
 Balanced accuracy = +
𝑇𝑃 + 𝐹𝑁 𝐹𝑃 + 𝑇𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
 F1 =2×
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

 ROC

 Precision / Recall
My Model Is Not Accurate Enough
45

 Suppose: Activity recognition into walking, running, ascend stairs,


descend stairs
 Five minutes of data from ten subjects
 Divide data into 5 second windows which yields 600 examples

 Use five simple features from X, Y, Z acceleration

 Train linear separator using log loss

 Optimize using gradient descent

 Leave-one-subject out CV: 70% accuracy

Question: What do I do?


Possible fixes
46

 More data
 More / better features
 Change optimizer
 Change objective function
 Change model class
Question: What do I do?

Option 1: Grad student descent and try everything

Option 2: Debug the learning process


Look at Learning Curve
47

Error

Error
Train
Test

#Training examples #Training examples


Train and test error  Train error is low
 High
 Test error is high
 Close
=> More data
 More/better features

 More expressive model?


Conclusions
48

 Feature selection is important in practice

 Think about what you want to show in your empirical evaluation

 Practical issues are hard


 It
is a lot of guess and check at first
 Eventually you develop intuitions

 Generally speaking: Features and data more important than model


Questions?
49

You might also like