SlideShare a Scribd company logo
FEATURE
ENGINEERING
David Epstein
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
FEATURE
ENGINEERING
David Epstein
Senior Data Scientist, Socure
Open Data Science Conference, Boston, MA
May 30-31, 2015
#ODSC, @opendatasci
Talk Outline
• What is feature engineering?
• Limits on number of features
• How to select a “good” set of features
• Standard FE techniques
• TL;DR: As we get better and better models,
focus shifts to what we put into them
• FE interacts with other key areas of DS
Feature Engineering
• (My) Definition: Transforming data to create
model inputs.
Raw
Data
Data
Cleaning
Feature
Engineering
Model
Building
Pre-Processing
Data Workflow
Examples from Kaggle Competitions
Netflix
Titanic
Portuguese Taxis
How does it work?
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.29879 0.48994 0.61 0.559
x -0.00923 0.10256 -0.09 0.930
x2 0.98803 0.03672 26.91 3.92e-09 ***
---
Residual standard error: 1.076 on 8 degrees of freedom
Multiple R-squared: 0.9891, Adjusted R-squared: 0.9863
F-statistic: 362 on 2 and 8 DF, p-value: 1.427e-08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.17912 2.92472 3.48 0.00693 **
x -0.00923 0.92488 -0.01 0.99225
---
Residual standard error: 9.7 on 9 degrees of freedom
Multiple R-squared: 1.107e-05, Adjusted R-squared: -0.11
F-statistic: 9.96e-05 on 1 and 9 DF, p-value: 0.9923
Features are engineered everywhere
Things to be explained/measured/predicted
Finance:
EBITDA
Baseball:
Batting Avg.
Politics:
Partisan Performance
The Big Questions
• Seen in this light, FE is ubiquitous (as all
truly important concepts are)
• Any time you construct an intermediate
variable, you’re doing FE
• Two questions naturally arise:
1. How do you construct “good” features?
2. What are the limits on this process?
• I’ll answer the second one first, because it’s
easier….
Limits on Feature Engineering
• In medical studies, social science, financial analysis, etc.,
two main problems emerge
• Eating up degrees of freedom: relatively small data sets
• # of respondents in survey
• # of patients in trial
• # of elections to Congress
• If your data lives in an NxK matrix, you want to make sure that K is
small relative to N
• Relevance to hypothesis testing, emphasis on explanation
• You generally start with an equation defining the relationship
between the key independent and dependent variables
• Other variables enter your model as controls, not really interested
in their functional form
Limits on Feature Engineering
• In most modern data science applications, neither is an
issue
• We start with lots of data, and
• Care more about prediction than explanation
• So why not add in lots of extra variables?
• Think of your data not as what goes into your model, but a starting
point for the creation of new variables, which can then be combined…
Limits on Feature Engineering
• First, adding many correlated predictors can decrease model
performance
• Adding an x4 term to above example actually reduces model fit:
• More variables make models less interpretable
• Models have to be generalizable to other data
• Too much feature engineering can lead to overfitting
• Close connection between feature engineering and cross-validation
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.648123 0.628178 1.032 0.336513
x -0.009230 0.103740 -0.089 0.931593
x2 0.866738 0.139087 6.232 0.000432 ***
x4 0.004852 0.005361 0.905 0.395572
---
Residual standard error: 1.088 on 7 degrees of freedom
Multiple R-squared: 0.9902, Adjusted R-squared: 0.986
F-statistic: 236.1 on 3 and 7 DF, p-value: 2.15e-07
How To Select a “Good” Set of Features
• This is the open-ended question in the field
• Separate (but related to) the question of feature selection
– which variables to retain in a model.
• You can use some metrics to tell which features are
useful, one at a time, like Pearson correlation coefficient
• But this can’t tell you which set of features works best together
• This is an NP complete problem, clearly too computationally hard
• Many new data analysis services include automated
feature engineering as part of their packages
• But if there are features you want in your model, it’s best to add
them in explicitly, rather than depend on these generators.
A “Middle Theory” of FE
• Start with a reasonable-sized set of features
• Include features suggested by domain knowledge
• Test these out individually, build from the bottom up
• Number and type of features depend on model used
• Can include more features if models does some feature selection
• Lasso regression, e.g., logit with ||L1|| regularization (but not ||L2|| ridge)
• GBM with backward pruning (but not random forests)
• Stepwise regression, with either forward or backward feature selection
• Some models are invariant to monotonic variable transformations
• Tree-based approaches divide variables into two groups at each branch
• So, no perfect answer. But there are some standard
techniques every data scientist should have in their bag of
tricks.
Non-numeric to numeric
1. Count # of times each value appears
Zip Code Count
10024 4
63105 2
94304 1
06443 3
10024 4
63105 2
06443 3
10024 4
10024 4
06443 3
Non-numeric to numeric
2. One-hot encoding
Religion Catholic Protestant Jewish Muslim
Catholic 1 0 0 0
Muslim 0 0 0 1
Jewish 0 0 1 0
Protestant 0 1 0 0
Catholic 1 0 0 0
Catholic 1 0 0 0
Jewish 0 0 1 0
Protestant 0 1 0 0
Muslim 0 0 0 1
Protestant 0 1 0 0
Non-numeric to numeric
3. The “hash trick”
string h(string)
“The” 36
“quick” 8
“brown” 92
“fox” 14
“jumps” 75
“over” 25
“the” 36
“lazy” 44
“dog” 21
Non-numeric to numeric
4. Leave-one-out encoding
Single Variable Transformations
x
x2
Log(x)
scaling
x
Two-variable combinations
1. Add: Sum similar-scaled variables
Q1 Q2 Q3 Q4 Total
33 88 51 81 251
11 11 72 30 124
15 36 70 55 176
70 82 8 50 209
99 56 35 86 276
7 20 10 71 107
65 0 25 74 164
96 25 2 89 211
60 29 56 92 238
63 50 96 61 269
Two-variable combinations
2. Subtract: Difference relative to baseline
ViewerID MovieID Date Rating
Days Since
First Rating
44972004 8825 1/1/13 5 0
44972004 0471 2/1/13 4 31
44972004 3816 3/1/13 5 59
44972004 8243 4/1/13 3 90
44972004 2855 5/1/13 5 120
44972004 9923 6/1/13 2 151
44972004 1023 7/1/13 4 181
44972004 8306 8/1/13 3 212
44972004 2771 9/1/13 2 243
44972004 5281 10/1/13 2 273
Two-variable combinations
3. Multiply: Interactive effects
ViewerID Country Domestic DSFR Dom*DSFR
8825 US 1 38 38
0471 CA 0 277 0
3816 FR 0 187 0
8243 US 1 33 33
2855 US 1 87 87
9923 GB 0 42 0
1023 IT 0 192 0
8306 CA 0 365 0
2771 US 1 505 505
5281 FI 0 49 0
Two-variable combinations
4. Divide: Scaling/Normalizing
Country GDP Population
GDP/Capit
a
US 159849 275 581
CA 731812 111 6593
FR 826320 90 9181
IT 573494 80 7169
IR 609223 22 27692
GB 717673 60 11961
NE 605257 15 40350
MX 687944 124 5548
RU 203319 402 506
FI 744983 40 18625
Multivariate/Model-based Methods
1. PCA/Factor Analysis/Clustering: Dimension reduction
Multivariate/Model-based Methods
2. Model Stacking: Outputs of one model are inputs
to the next
• Do this, e.g., on half the data and use as input to the other
half, and vice-versa
(Figure from Owen Zhang)
Conclusion
• Data science grew over the last decade due to
improved modeling algorithms, better cross-
validation procedures
• Now we have a number of good models, can get
at problem from lots of different angles
• Most improvement will come from thinking
carefully about what we put into our models =
Feature Engineering
• This can be (semi-) automated, but it’s still one of
the true arts of the profession
• Domain knowledge remains very important in practice
Ad

More Related Content

What's hot (20)

Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
Impetus Technologies
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Eng Teong Cheah
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
Gabriel Cypriano Saca
 
Genetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuningGenetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuning
Dr. Jyoti Obia
 
Machine learning
Machine learningMachine learning
Machine learning
Dr Geetha Mohan
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
Darius Barušauskas
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Databricks
 
Decision tree
Decision treeDecision tree
Decision tree
R A Akerkar
 
Classification
ClassificationClassification
Classification
CloudxLab
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
introduction to machin learning
introduction to machin learningintroduction to machin learning
introduction to machin learning
nilimapatel6
 
Decision tree
Decision treeDecision tree
Decision tree
SEMINARGROOT
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Dinesh V
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
amalalhait
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
Impetus Technologies
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Eng Teong Cheah
 
Genetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuningGenetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuning
Dr. Jyoti Obia
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
Darius Barušauskas
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Databricks
 
Classification
ClassificationClassification
Classification
CloudxLab
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
introduction to machin learning
introduction to machin learningintroduction to machin learning
introduction to machin learning
nilimapatel6
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Dinesh V
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
amalalhait
 

Similar to Feature Engineering (20)

SPC Training by D&H Engineers
SPC Training by D&H EngineersSPC Training by D&H Engineers
SPC Training by D&H Engineers
D&H Engineers
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
SIMPLE CORRECTION FOR MEASUREMENT ERRORS WITH STATA
SIMPLE CORRECTION  FOR MEASUREMENT  ERRORS WITH STATASIMPLE CORRECTION  FOR MEASUREMENT  ERRORS WITH STATA
SIMPLE CORRECTION FOR MEASUREMENT ERRORS WITH STATA
ssuserf58323
 
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...
Lili Zhang
 
JamieStainer ATA SCIEnCE path finder.pptx
JamieStainer ATA SCIEnCE path finder.pptxJamieStainer ATA SCIEnCE path finder.pptx
JamieStainer ATA SCIEnCE path finder.pptx
RadhaKilari
 
AI AND DATA SCIENCE generative data scinece.pptx
AI AND DATA SCIENCE generative data scinece.pptxAI AND DATA SCIENCE generative data scinece.pptx
AI AND DATA SCIENCE generative data scinece.pptx
RadhaKilari
 
Chapter 6 data analysis iec11
Chapter 6 data analysis iec11Chapter 6 data analysis iec11
Chapter 6 data analysis iec11
Ho Cao Viet
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
BillyMoses1
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
Soumya Mukherjee
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model Checking
Akos Hajdu
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
report
reportreport
report
Arthur He
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
CIToolkit
 
IIITDM_Kanchee_Senthil_Dec2024_Part2.pdf
IIITDM_Kanchee_Senthil_Dec2024_Part2.pdfIIITDM_Kanchee_Senthil_Dec2024_Part2.pdf
IIITDM_Kanchee_Senthil_Dec2024_Part2.pdf
kumarvs3
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
AnushaSharma81
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
multivariate data Analysis and Principal Component Analysis.pptx
multivariate data Analysis and Principal Component Analysis.pptxmultivariate data Analysis and Principal Component Analysis.pptx
multivariate data Analysis and Principal Component Analysis.pptx
apsapssingh9
 
Six Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data AccarucySix Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data Accarucy
xyhfun
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom Discovery
Giuseppe Rizzo
 
SPC Training by D&H Engineers
SPC Training by D&H EngineersSPC Training by D&H Engineers
SPC Training by D&H Engineers
D&H Engineers
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
SIMPLE CORRECTION FOR MEASUREMENT ERRORS WITH STATA
SIMPLE CORRECTION  FOR MEASUREMENT  ERRORS WITH STATASIMPLE CORRECTION  FOR MEASUREMENT  ERRORS WITH STATA
SIMPLE CORRECTION FOR MEASUREMENT ERRORS WITH STATA
ssuserf58323
 
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...
Lili Zhang
 
JamieStainer ATA SCIEnCE path finder.pptx
JamieStainer ATA SCIEnCE path finder.pptxJamieStainer ATA SCIEnCE path finder.pptx
JamieStainer ATA SCIEnCE path finder.pptx
RadhaKilari
 
AI AND DATA SCIENCE generative data scinece.pptx
AI AND DATA SCIENCE generative data scinece.pptxAI AND DATA SCIENCE generative data scinece.pptx
AI AND DATA SCIENCE generative data scinece.pptx
RadhaKilari
 
Chapter 6 data analysis iec11
Chapter 6 data analysis iec11Chapter 6 data analysis iec11
Chapter 6 data analysis iec11
Ho Cao Viet
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
BillyMoses1
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
Soumya Mukherjee
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model Checking
Akos Hajdu
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
CIToolkit
 
IIITDM_Kanchee_Senthil_Dec2024_Part2.pdf
IIITDM_Kanchee_Senthil_Dec2024_Part2.pdfIIITDM_Kanchee_Senthil_Dec2024_Part2.pdf
IIITDM_Kanchee_Senthil_Dec2024_Part2.pdf
kumarvs3
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
multivariate data Analysis and Principal Component Analysis.pptx
multivariate data Analysis and Principal Component Analysis.pptxmultivariate data Analysis and Principal Component Analysis.pptx
multivariate data Analysis and Principal Component Analysis.pptx
apsapssingh9
 
Six Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data AccarucySix Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data Accarucy
xyhfun
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom Discovery
Giuseppe Rizzo
 
Ad

More from odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
odsc
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
odsc
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
odsc
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
odsc
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
odsc
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
odsc
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
odsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
odsc
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
odsc
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
odsc
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
odsc
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
odsc
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
odsc
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
odsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
odsc
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
odsc
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
odsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
odsc
 
Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
odsc
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
odsc
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
odsc
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
odsc
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
odsc
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
odsc
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
odsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
odsc
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
odsc
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
odsc
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
odsc
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
odsc
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
odsc
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
odsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
odsc
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
odsc
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
odsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
odsc
 
Ad

Recently uploaded (20)

Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 

Feature Engineering

  • 1. FEATURE ENGINEERING David Epstein O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  • 2. FEATURE ENGINEERING David Epstein Senior Data Scientist, Socure Open Data Science Conference, Boston, MA May 30-31, 2015 #ODSC, @opendatasci
  • 3. Talk Outline • What is feature engineering? • Limits on number of features • How to select a “good” set of features • Standard FE techniques • TL;DR: As we get better and better models, focus shifts to what we put into them • FE interacts with other key areas of DS
  • 4. Feature Engineering • (My) Definition: Transforming data to create model inputs. Raw Data Data Cleaning Feature Engineering Model Building Pre-Processing Data Workflow
  • 5. Examples from Kaggle Competitions Netflix Titanic Portuguese Taxis
  • 6. How does it work? Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.29879 0.48994 0.61 0.559 x -0.00923 0.10256 -0.09 0.930 x2 0.98803 0.03672 26.91 3.92e-09 *** --- Residual standard error: 1.076 on 8 degrees of freedom Multiple R-squared: 0.9891, Adjusted R-squared: 0.9863 F-statistic: 362 on 2 and 8 DF, p-value: 1.427e-08 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.17912 2.92472 3.48 0.00693 ** x -0.00923 0.92488 -0.01 0.99225 --- Residual standard error: 9.7 on 9 degrees of freedom Multiple R-squared: 1.107e-05, Adjusted R-squared: -0.11 F-statistic: 9.96e-05 on 1 and 9 DF, p-value: 0.9923
  • 7. Features are engineered everywhere Things to be explained/measured/predicted Finance: EBITDA Baseball: Batting Avg. Politics: Partisan Performance
  • 8. The Big Questions • Seen in this light, FE is ubiquitous (as all truly important concepts are) • Any time you construct an intermediate variable, you’re doing FE • Two questions naturally arise: 1. How do you construct “good” features? 2. What are the limits on this process? • I’ll answer the second one first, because it’s easier….
  • 9. Limits on Feature Engineering • In medical studies, social science, financial analysis, etc., two main problems emerge • Eating up degrees of freedom: relatively small data sets • # of respondents in survey • # of patients in trial • # of elections to Congress • If your data lives in an NxK matrix, you want to make sure that K is small relative to N • Relevance to hypothesis testing, emphasis on explanation • You generally start with an equation defining the relationship between the key independent and dependent variables • Other variables enter your model as controls, not really interested in their functional form
  • 10. Limits on Feature Engineering • In most modern data science applications, neither is an issue • We start with lots of data, and • Care more about prediction than explanation • So why not add in lots of extra variables? • Think of your data not as what goes into your model, but a starting point for the creation of new variables, which can then be combined…
  • 11. Limits on Feature Engineering • First, adding many correlated predictors can decrease model performance • Adding an x4 term to above example actually reduces model fit: • More variables make models less interpretable • Models have to be generalizable to other data • Too much feature engineering can lead to overfitting • Close connection between feature engineering and cross-validation Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.648123 0.628178 1.032 0.336513 x -0.009230 0.103740 -0.089 0.931593 x2 0.866738 0.139087 6.232 0.000432 *** x4 0.004852 0.005361 0.905 0.395572 --- Residual standard error: 1.088 on 7 degrees of freedom Multiple R-squared: 0.9902, Adjusted R-squared: 0.986 F-statistic: 236.1 on 3 and 7 DF, p-value: 2.15e-07
  • 12. How To Select a “Good” Set of Features • This is the open-ended question in the field • Separate (but related to) the question of feature selection – which variables to retain in a model. • You can use some metrics to tell which features are useful, one at a time, like Pearson correlation coefficient • But this can’t tell you which set of features works best together • This is an NP complete problem, clearly too computationally hard • Many new data analysis services include automated feature engineering as part of their packages • But if there are features you want in your model, it’s best to add them in explicitly, rather than depend on these generators.
  • 13. A “Middle Theory” of FE • Start with a reasonable-sized set of features • Include features suggested by domain knowledge • Test these out individually, build from the bottom up • Number and type of features depend on model used • Can include more features if models does some feature selection • Lasso regression, e.g., logit with ||L1|| regularization (but not ||L2|| ridge) • GBM with backward pruning (but not random forests) • Stepwise regression, with either forward or backward feature selection • Some models are invariant to monotonic variable transformations • Tree-based approaches divide variables into two groups at each branch • So, no perfect answer. But there are some standard techniques every data scientist should have in their bag of tricks.
  • 14. Non-numeric to numeric 1. Count # of times each value appears Zip Code Count 10024 4 63105 2 94304 1 06443 3 10024 4 63105 2 06443 3 10024 4 10024 4 06443 3
  • 15. Non-numeric to numeric 2. One-hot encoding Religion Catholic Protestant Jewish Muslim Catholic 1 0 0 0 Muslim 0 0 0 1 Jewish 0 0 1 0 Protestant 0 1 0 0 Catholic 1 0 0 0 Catholic 1 0 0 0 Jewish 0 0 1 0 Protestant 0 1 0 0 Muslim 0 0 0 1 Protestant 0 1 0 0
  • 16. Non-numeric to numeric 3. The “hash trick” string h(string) “The” 36 “quick” 8 “brown” 92 “fox” 14 “jumps” 75 “over” 25 “the” 36 “lazy” 44 “dog” 21
  • 17. Non-numeric to numeric 4. Leave-one-out encoding
  • 19. Two-variable combinations 1. Add: Sum similar-scaled variables Q1 Q2 Q3 Q4 Total 33 88 51 81 251 11 11 72 30 124 15 36 70 55 176 70 82 8 50 209 99 56 35 86 276 7 20 10 71 107 65 0 25 74 164 96 25 2 89 211 60 29 56 92 238 63 50 96 61 269
  • 20. Two-variable combinations 2. Subtract: Difference relative to baseline ViewerID MovieID Date Rating Days Since First Rating 44972004 8825 1/1/13 5 0 44972004 0471 2/1/13 4 31 44972004 3816 3/1/13 5 59 44972004 8243 4/1/13 3 90 44972004 2855 5/1/13 5 120 44972004 9923 6/1/13 2 151 44972004 1023 7/1/13 4 181 44972004 8306 8/1/13 3 212 44972004 2771 9/1/13 2 243 44972004 5281 10/1/13 2 273
  • 21. Two-variable combinations 3. Multiply: Interactive effects ViewerID Country Domestic DSFR Dom*DSFR 8825 US 1 38 38 0471 CA 0 277 0 3816 FR 0 187 0 8243 US 1 33 33 2855 US 1 87 87 9923 GB 0 42 0 1023 IT 0 192 0 8306 CA 0 365 0 2771 US 1 505 505 5281 FI 0 49 0
  • 22. Two-variable combinations 4. Divide: Scaling/Normalizing Country GDP Population GDP/Capit a US 159849 275 581 CA 731812 111 6593 FR 826320 90 9181 IT 573494 80 7169 IR 609223 22 27692 GB 717673 60 11961 NE 605257 15 40350 MX 687944 124 5548 RU 203319 402 506 FI 744983 40 18625
  • 23. Multivariate/Model-based Methods 1. PCA/Factor Analysis/Clustering: Dimension reduction
  • 24. Multivariate/Model-based Methods 2. Model Stacking: Outputs of one model are inputs to the next • Do this, e.g., on half the data and use as input to the other half, and vice-versa (Figure from Owen Zhang)
  • 25. Conclusion • Data science grew over the last decade due to improved modeling algorithms, better cross- validation procedures • Now we have a number of good models, can get at problem from lots of different angles • Most improvement will come from thinking carefully about what we put into our models = Feature Engineering • This can be (semi-) automated, but it’s still one of the true arts of the profession • Domain knowledge remains very important in practice

Editor's Notes

  • #8: Seen in this light, FE is ubiquitous (as all truly important concepts are) Any time you construct an intermediate variable, you’re doing FE