SlideShare a Scribd company logo
Lead Data Scientist DSc. student
TDC 2017
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
Data
Models
Features
Useful attributes for your modeling task
"Feature engineering is the process of
transforming raw data into features that better
represent the underlying problem to the
predictive models, resulting in improved
model accuracy on unseen data."
Jason Brownlee
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
Raw data Dataset Model Task
? ML Ready
dataset
Features Model Task
Raw data
Here are some Feature Engineering techniques
for your Data Science toolbox...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
Dataset
● Sample of users page views
and clicks during
14 days on June, 2016
● 2 Billion page views
● 17 million click records
● 700 Million unique users
● 560 sites
Can you predict which recommended content each user will click?
More details of my solution in this post series
●
●
●
●
●
Numerical
Spatial
Temporal
Categorical
Target
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
Homogenize missing values and different types of in the same feature, fix input errors, types, etc.
Original data
Cleaned data
Fields (Features)
Instances
Tabular data (rows and columns)
● Usually denormalized in a single file/dataset
● Each row contains information about one instance
● Each column is a feature that describes a property of the instance
Necessary when the entity to model is an aggregation from the provided data.
Original data (list of playbacks)
Aggregated data (list of users)
Necessary when the entity to model is an aggregation from the provided data.
Aggregated data with pivoted columns
Original data
# playbacks by device Play duration by device
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
● Usually easy to ingest by mathematical models.
● Can be prices, measurements, counts, ...
● Easier to impute missing data
● Distribution and scale matters to many models
● Datasets contain missing values, often encoded as blanks, NaNs or other
placeholders
● Ignoring rows and/or columns with missing values is possible, but at the price of
loosing data which might be valuable
● Better strategy is to infer them from the known part of data
● Strategies
○ Mean: Basic approach
○ Median: More robust to outliers
○ Mode: Most frequent value
○ Using a model: Can expose algorithmic bias
● Transform discrete or continuous numeric features in binary features
Example: Number of user views of the same document
>>> from sklearn import preprocessing
>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> binarizer =
preprocessing.Binarizer(threshold=1.0)
>>> binarizer.transform(X)
array([[ 1., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
Binarization with scikit-learn
● Split numerical values into bins and encode with a bin ID
● Can be set arbitrarily or based on distribution
● Fixed-width binning
Does fixed-width binning make sense for this long-tailed distribution?
Most users (458,234,809 ~ 5*10^8) had only 1 pageview during the period.
● Adaptative or Quantile binning
Divides data into equal portions (eg. by median, quartiles, deciles)
>>> deciles = dataframe['review_count'].quantile([.1, .2, .3, .4, .5, .6, .7,
.8, .9])
>>> deciles
0.1 3.0
0.2 4.0
0.3 5.0
0.4 6.0
0.5 8.0
0.6 12.0
0.7 17.0
0.8 28.0
0.9 58.0
Quantile binning with Pandas
Compresses the range of large numbers and expand the range of small numbers.
Eg. The larger x is, the slower log(x) increments.
Histogram of # views by user Histogram of # views by user
smoothed by log(1+x)
Smoothing long-tailed data with log
● Models that are smooth functions of input features are sensitive to the scale
of the input (eg. Linear Regression)
● Scale numerical variables into a certain range, dividing values by a
normalization constant (no changes in single-feature distribution)
● Popular techniques
○ MinMax Scaling
○ Standard (Z) Scaling
● Squeezes (or stretches) all values within the range of [0, 1] to add robustness to
very small standard deviations and preserving zeros for sparse data.
>>> from sklearn import preprocessing
>>> X_train = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
...
>>> min_max_scaler =
preprocessing.MinMaxScaler()
>>> X_train_minmax =
min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5 , 0. , 1. ],
[ 1. , 0.5 , 0.33333333],
[ 0. , 1. , 0. ]])
Min-max scaling with scikit-learn
After Standardization, a feature has mean of 0 and variance of 1 (assumption of
many learning algorithms)
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
Standardization with scikit-learn
● Simple linear models use a linear combination of the individual input
features, x1
, x2
, ... xn
to predict the outcome y.
y = w1
x1
+ w2
x2
+ ... + wn
xn
● An easy way to increase the complexity of the linear model is to create
feature combinations (nonlinear features).
Area (m2)Example (House Pricing Prediction)
Degree 2 interaction features for vector x = (x1,
x2
)
y = w1
x1
+ w2
x2
+ w3
x1
x2
+ w4
x1
2
+ w4
x2
2
# Rooms
Price
>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = poly = PolynomialFeatures(degree=2, interaction_only=False,
include_bias=True)
>>> poly.fit_transform(X)
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
Polynomial features with scikit-learn
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
● Nearly always need some treatment to be suitable for models
● Examples:
Platform: [“desktop”, “tablet”, “mobile”]
Document_ID or User_ID: [121545, 64845, 121545]
● High cardinality can create very sparse data
● Difficult to impute missing
● Transform a categorical feature with m possible values into m binary features.
● If the variable cannot be multiple categories at once, then only one bit in the
group can be on.
● Sparse format is memory-friendly
● Example: “platform=tablet” can be sparsely encoded as “2:1”
● Common in applications like targeted advertising and fraud detection
● Example:
Some large categorical features from Outbrain Click Prediction competition
● Hashes categorical values into vectors with fixed-length.
● Lower sparsity and higher compression compared to OHE
● Deals with new and rare categorical values (eg: new user-agents)
● May introduce collisions
100 hashed columns
● Instead of using the actual categorical value, use a global statistic of this
category on historical data.
● Useful for both linear and non-linear algorithms
● May give collisions (same encoding for different categories)
● Be careful about leakage
or or
Counts Click-Through Rate
P(click | ad) = ad_clicks / ad_views
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
Factors to consider:
● Multiple time zones in some countries
● Daylight Saving Time (DST)
○ Start and end DST dates
● Apply binning on time data to make it categorial and more general.
● Binning a time in hours or periods of day, like below.
● Extraction: weekday/weekend, weeks, months, quarters, years...
Hour range Bin ID Bin Description
[5, 8) 1 Early Morning
[8, 11) 2 Morning
[11, 14) 3 Midday
[14, 19) 4 Afternoon
[19, 22) 5 Evening
[22-24) and (00-05] 6 Night
● Instead of encoding: total spend, encode things like:
Spend in last week, spend in last month, spend in last
year.
● Gives a trend to the algorithm: two customers with equal
spend, can have wildly different behavior — one
customer may be starting to spend more, while the other
is starting to decline spending.
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
● Spatial variables encode a location in space, like:
○ GPS-coordinates (lat. / long.) - sometimes require projection to a different
coordinate system
○ Street Addresses - require geocoding
○ ZipCodes, Cities, States, Countries - usually enriched with the centroid
coordinate of the polygon (from external GIS data)
● Derived features
○ Distance between a user location and searched hotels (Expedia competition)
○ Impossible travel speed (fraud detection)
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
Reduces model complexity and training time
● Filtering - Eg. Correlation our Mutual Information between
each feature and the response variable
● Wrapper methods - Expensive, trying to optimize the best
subset of features (eg. Stepwise Regression)
● Embedded methods - Feature selection as part of model
training process (eg. Feature Importances of Decision Trees or
Trees Ensembles)
Outbrain Click Prediction - Leaderboard score of my approaches
Towards Automated Feature
Engineering
Deep Learning....
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering
Questions?
Gabriel Moreira
Lead Data Scientist
@gspmoreira
Versão extendida:
bit.ly/feature_eng_tdc

More Related Content

What's hot (20)

PDF
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
PPTX
3. R- list and data frame
krishna singh
 
PDF
Cluster Analysis for Dummies
Venkata Reddy Konasani
 
PDF
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
Dataconomy Media
 
PPTX
5. working on data using R -Cleaning, filtering ,transformation, Sampling
krishna singh
 
PPTX
Comparison Study of Decision Tree Ensembles for Regression
Seonho Park
 
PDF
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
Dataconomy Media
 
PPTX
Neural Learning to Rank
Bhaskar Mitra
 
PDF
Cluster analysis
Hohai university
 
PDF
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Vivian S. Zhang
 
PDF
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
PDF
Dbm630 lecture09
Tokyo Institute of Technology
 
PDF
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
PDF
Ridge regression, lasso and elastic net
Vivian S. Zhang
 
PDF
Advance data structure & algorithm
K Hari Shankar
 
PDF
Tile Menu Using Datawindow Object
zulmach .
 
PDF
Random forest using apache mahout
Gaurav Kasliwal
 
PDF
11 mm91r05
Krishna Karri
 
PPT
Techwave 2006 Advanced Datawindow Techniques
Buck Woolley
 
PDF
#PowerBuilder #tile #menu using #Datawindow
zulmach .
 
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
3. R- list and data frame
krishna singh
 
Cluster Analysis for Dummies
Venkata Reddy Konasani
 
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
Dataconomy Media
 
5. working on data using R -Cleaning, filtering ,transformation, Sampling
krishna singh
 
Comparison Study of Decision Tree Ensembles for Regression
Seonho Park
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
Dataconomy Media
 
Neural Learning to Rank
Bhaskar Mitra
 
Cluster analysis
Hohai university
 
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Vivian S. Zhang
 
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
Ridge regression, lasso and elastic net
Vivian S. Zhang
 
Advance data structure & algorithm
K Hari Shankar
 
Tile Menu Using Datawindow Object
zulmach .
 
Random forest using apache mahout
Gaurav Kasliwal
 
11 mm91r05
Krishna Karri
 
Techwave 2006 Advanced Datawindow Techniques
Buck Woolley
 
#PowerBuilder #tile #menu using #Datawindow
zulmach .
 

Similar to TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering (20)

PDF
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
PDF
Machine Learning : why we should know and how it works
Kevin Lee
 
PDF
R programmingmilano
Ismail Seyrik
 
PDF
Prepare your data for machine learning
Ivo Andreev
 
PPTX
Machine learning and linear regression programming
Soumya Mukherjee
 
PDF
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
PDF
R Programming - part 1.pdf
RohanBorgalli
 
PDF
C3 w2
Ajay Taneja
 
PDF
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Citus Data
 
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
PDF
Speaker Diarization
HONGJOO LEE
 
PDF
Support Vector Machines ( SVM )
Mohammad Junaid Khan
 
PPTX
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf
 
PDF
R user group meeting 25th jan 2017
Garrett Teoh Hor Keong
 
PPTX
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
PPTX
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
ratnapatil14
 
PDF
Elegant Graphics for Data Analysis with ggplot2
yannabraham
 
PDF
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
PDF
dfdshofdifhdifhdfhgfoighfgofgfgfgfgdfdfdfdf
nguyenhoangy207
 
PDF
Automatic Forecasting at Scale
Sean Taylor
 
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Machine Learning : why we should know and how it works
Kevin Lee
 
R programmingmilano
Ismail Seyrik
 
Prepare your data for machine learning
Ivo Andreev
 
Machine learning and linear regression programming
Soumya Mukherjee
 
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
R Programming - part 1.pdf
RohanBorgalli
 
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Citus Data
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
Speaker Diarization
HONGJOO LEE
 
Support Vector Machines ( SVM )
Mohammad Junaid Khan
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf
 
R user group meeting 25th jan 2017
Garrett Teoh Hor Keong
 
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
ratnapatil14
 
Elegant Graphics for Data Analysis with ggplot2
yannabraham
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
dfdshofdifhdifhdfhgfoighfgofgfgfgfgdfdfdfdf
nguyenhoangy207
 
Automatic Forecasting at Scale
Sean Taylor
 
Ad

More from tdc-globalcode (20)

PDF
TDC2019 Intel Software Day - Visao Computacional e IA a servico da humanidade
tdc-globalcode
 
PDF
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
tdc-globalcode
 
PDF
TDC2019 Intel Software Day - ACATE - Cases de Sucesso
tdc-globalcode
 
PDF
TDC2019 Intel Software Day - Otimizacao grafica com o Intel GPA
tdc-globalcode
 
PDF
TDC2019 Intel Software Day - Deteccao de objetos em tempo real com OpenVino
tdc-globalcode
 
PDF
TDC2019 Intel Software Day - OpenCV: Inteligencia artificial e Visao Computac...
tdc-globalcode
 
PDF
TDC2019 Intel Software Day - Inferencia de IA em edge devices
tdc-globalcode
 
PDF
Trilha BigData - Banco de Dados Orientado a Grafos na Seguranca Publica
tdc-globalcode
 
PPT
Trilha .Net - Programacao funcional usando f#
tdc-globalcode
 
PDF
TDC2018SP | Trilha Go - Case Easylocus
tdc-globalcode
 
PDF
TDC2018SP | Trilha Modern Web - Para onde caminha a Web?
tdc-globalcode
 
PDF
TDC2018SP | Trilha Go - Clean architecture em Golang
tdc-globalcode
 
PDF
TDC2018SP | Trilha Go - "Go" tambem e linguagem de QA
tdc-globalcode
 
PDF
TDC2018SP | Trilha Mobile - Digital Wallets - Seguranca, inovacao e tendencia
tdc-globalcode
 
PDF
TDC2018SP | Trilha .Net - Real Time apps com Azure SignalR Service
tdc-globalcode
 
PDF
TDC2018SP | Trilha .Net - Passado, Presente e Futuro do .NET
tdc-globalcode
 
PDF
TDC2018SP | Trilha .Net - Novidades do C# 7 e 8
tdc-globalcode
 
PDF
TDC2018SP | Trilha .Net - Obtendo metricas com TDD utilizando build automatiz...
tdc-globalcode
 
PDF
TDC2018SP | Trilha .Net - .NET funcional com F#
tdc-globalcode
 
PDF
TDC2018SP | Trilha .Net - Crie SPAs com Razor e C# usando Blazor em .Net Core
tdc-globalcode
 
TDC2019 Intel Software Day - Visao Computacional e IA a servico da humanidade
tdc-globalcode
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
tdc-globalcode
 
TDC2019 Intel Software Day - ACATE - Cases de Sucesso
tdc-globalcode
 
TDC2019 Intel Software Day - Otimizacao grafica com o Intel GPA
tdc-globalcode
 
TDC2019 Intel Software Day - Deteccao de objetos em tempo real com OpenVino
tdc-globalcode
 
TDC2019 Intel Software Day - OpenCV: Inteligencia artificial e Visao Computac...
tdc-globalcode
 
TDC2019 Intel Software Day - Inferencia de IA em edge devices
tdc-globalcode
 
Trilha BigData - Banco de Dados Orientado a Grafos na Seguranca Publica
tdc-globalcode
 
Trilha .Net - Programacao funcional usando f#
tdc-globalcode
 
TDC2018SP | Trilha Go - Case Easylocus
tdc-globalcode
 
TDC2018SP | Trilha Modern Web - Para onde caminha a Web?
tdc-globalcode
 
TDC2018SP | Trilha Go - Clean architecture em Golang
tdc-globalcode
 
TDC2018SP | Trilha Go - "Go" tambem e linguagem de QA
tdc-globalcode
 
TDC2018SP | Trilha Mobile - Digital Wallets - Seguranca, inovacao e tendencia
tdc-globalcode
 
TDC2018SP | Trilha .Net - Real Time apps com Azure SignalR Service
tdc-globalcode
 
TDC2018SP | Trilha .Net - Passado, Presente e Futuro do .NET
tdc-globalcode
 
TDC2018SP | Trilha .Net - Novidades do C# 7 e 8
tdc-globalcode
 
TDC2018SP | Trilha .Net - Obtendo metricas com TDD utilizando build automatiz...
tdc-globalcode
 
TDC2018SP | Trilha .Net - .NET funcional com F#
tdc-globalcode
 
TDC2018SP | Trilha .Net - Crie SPAs com Razor e C# usando Blazor em .Net Core
tdc-globalcode
 
Ad

Recently uploaded (20)

PDF
Wikinomics How Mass Collaboration Changes Everything Don Tapscott
wcsqyzf5909
 
PPTX
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
PPTX
How Physics Enhances Our Quality of Life.pptx
AngeliqueTolentinoDe
 
PDF
Public Health For The 21st Century 1st Edition Judy Orme Jane Powell
trjnesjnqg7801
 
PPTX
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
PPTX
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
PPTX
Tanja Vujicic - PISA for Schools contact Info
EduSkills OECD
 
PDF
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
PPTX
Martyrs of Ireland - who kept the faith of St. Patrick.pptx
Martin M Flynn
 
PPTX
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
PPT
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
PDF
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
PPTX
How to use grouped() method in Odoo 18 - Odoo Slides
Celine George
 
PDF
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
PDF
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
DOCX
DLL english grade five goof for one week
FlordelynGonzales1
 
PPTX
ESP 10 Edukasyon sa Pagpapakatao PowerPoint Lessons Quarter 1.pptx
Sir J.
 
PPTX
Project 4 PART 1 AI Assistant Vocational Education
barmanjit380
 
PDF
Gladiolous Cultivation practices by AKL.pdf
kushallamichhame
 
PPTX
Elo the HeroTHIS IS A STORY ABOUT A BOY WHO SAVED A LITTLE GOAT .pptx
JoyIPanos
 
Wikinomics How Mass Collaboration Changes Everything Don Tapscott
wcsqyzf5909
 
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
How Physics Enhances Our Quality of Life.pptx
AngeliqueTolentinoDe
 
Public Health For The 21st Century 1st Edition Judy Orme Jane Powell
trjnesjnqg7801
 
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
Tanja Vujicic - PISA for Schools contact Info
EduSkills OECD
 
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
Martyrs of Ireland - who kept the faith of St. Patrick.pptx
Martin M Flynn
 
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
How to use grouped() method in Odoo 18 - Odoo Slides
Celine George
 
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
DLL english grade five goof for one week
FlordelynGonzales1
 
ESP 10 Edukasyon sa Pagpapakatao PowerPoint Lessons Quarter 1.pptx
Sir J.
 
Project 4 PART 1 AI Assistant Vocational Education
barmanjit380
 
Gladiolous Cultivation practices by AKL.pdf
kushallamichhame
 
Elo the HeroTHIS IS A STORY ABOUT A BOY WHO SAVED A LITTLE GOAT .pptx
JoyIPanos
 

TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at - Feature Engineering

  • 1. Lead Data Scientist DSc. student TDC 2017
  • 4. "Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data." Jason Brownlee
  • 6. Raw data Dataset Model Task
  • 7. ? ML Ready dataset Features Model Task Raw data
  • 8. Here are some Feature Engineering techniques for your Data Science toolbox...
  • 10. Dataset ● Sample of users page views and clicks during 14 days on June, 2016 ● 2 Billion page views ● 17 million click records ● 700 Million unique users ● 560 sites Can you predict which recommended content each user will click?
  • 11. More details of my solution in this post series
  • 15. Homogenize missing values and different types of in the same feature, fix input errors, types, etc. Original data Cleaned data
  • 16. Fields (Features) Instances Tabular data (rows and columns) ● Usually denormalized in a single file/dataset ● Each row contains information about one instance ● Each column is a feature that describes a property of the instance
  • 17. Necessary when the entity to model is an aggregation from the provided data. Original data (list of playbacks) Aggregated data (list of users)
  • 18. Necessary when the entity to model is an aggregation from the provided data. Aggregated data with pivoted columns Original data # playbacks by device Play duration by device
  • 20. ● Usually easy to ingest by mathematical models. ● Can be prices, measurements, counts, ... ● Easier to impute missing data ● Distribution and scale matters to many models
  • 21. ● Datasets contain missing values, often encoded as blanks, NaNs or other placeholders ● Ignoring rows and/or columns with missing values is possible, but at the price of loosing data which might be valuable ● Better strategy is to infer them from the known part of data ● Strategies ○ Mean: Basic approach ○ Median: More robust to outliers ○ Mode: Most frequent value ○ Using a model: Can expose algorithmic bias
  • 22. ● Transform discrete or continuous numeric features in binary features Example: Number of user views of the same document >>> from sklearn import preprocessing >>> X = [[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]] >>> binarizer = preprocessing.Binarizer(threshold=1.0) >>> binarizer.transform(X) array([[ 1., 0., 1.], [ 1., 0., 0.], [ 0., 1., 0.]]) Binarization with scikit-learn
  • 23. ● Split numerical values into bins and encode with a bin ID ● Can be set arbitrarily or based on distribution ● Fixed-width binning Does fixed-width binning make sense for this long-tailed distribution? Most users (458,234,809 ~ 5*10^8) had only 1 pageview during the period.
  • 24. ● Adaptative or Quantile binning Divides data into equal portions (eg. by median, quartiles, deciles) >>> deciles = dataframe['review_count'].quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9]) >>> deciles 0.1 3.0 0.2 4.0 0.3 5.0 0.4 6.0 0.5 8.0 0.6 12.0 0.7 17.0 0.8 28.0 0.9 58.0 Quantile binning with Pandas
  • 25. Compresses the range of large numbers and expand the range of small numbers. Eg. The larger x is, the slower log(x) increments.
  • 26. Histogram of # views by user Histogram of # views by user smoothed by log(1+x) Smoothing long-tailed data with log
  • 27. ● Models that are smooth functions of input features are sensitive to the scale of the input (eg. Linear Regression) ● Scale numerical variables into a certain range, dividing values by a normalization constant (no changes in single-feature distribution) ● Popular techniques ○ MinMax Scaling ○ Standard (Z) Scaling
  • 28. ● Squeezes (or stretches) all values within the range of [0, 1] to add robustness to very small standard deviations and preserving zeros for sparse data. >>> from sklearn import preprocessing >>> X_train = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) ... >>> min_max_scaler = preprocessing.MinMaxScaler() >>> X_train_minmax = min_max_scaler.fit_transform(X_train) >>> X_train_minmax array([[ 0.5 , 0. , 1. ], [ 1. , 0.5 , 0.33333333], [ 0. , 1. , 0. ]]) Min-max scaling with scikit-learn
  • 29. After Standardization, a feature has mean of 0 and variance of 1 (assumption of many learning algorithms) >>> from sklearn import preprocessing >>> import numpy as np >>> X = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) >>> X_scaled = preprocessing.scale(X) >>> X_scaled array([[ 0. ..., -1.22..., 1.33...], [ 1.22..., 0. ..., -0.26...], [-1.22..., 1.22..., -1.06...]]) >> X_scaled.mean(axis=0) array([ 0., 0., 0.]) >>> X_scaled.std(axis=0) array([ 1., 1., 1.]) Standardization with scikit-learn
  • 30. ● Simple linear models use a linear combination of the individual input features, x1 , x2 , ... xn to predict the outcome y. y = w1 x1 + w2 x2 + ... + wn xn ● An easy way to increase the complexity of the linear model is to create feature combinations (nonlinear features). Area (m2)Example (House Pricing Prediction) Degree 2 interaction features for vector x = (x1, x2 ) y = w1 x1 + w2 x2 + w3 x1 x2 + w4 x1 2 + w4 x2 2 # Rooms Price
  • 31. >>> import numpy as np >>> from sklearn.preprocessing import PolynomialFeatures >>> X = np.arange(6).reshape(3, 2) >>> X array([[0, 1], [2, 3], [4, 5]]) >>> poly = poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=True) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.]]) Polynomial features with scikit-learn
  • 33. ● Nearly always need some treatment to be suitable for models ● Examples: Platform: [“desktop”, “tablet”, “mobile”] Document_ID or User_ID: [121545, 64845, 121545] ● High cardinality can create very sparse data ● Difficult to impute missing
  • 34. ● Transform a categorical feature with m possible values into m binary features. ● If the variable cannot be multiple categories at once, then only one bit in the group can be on. ● Sparse format is memory-friendly ● Example: “platform=tablet” can be sparsely encoded as “2:1”
  • 35. ● Common in applications like targeted advertising and fraud detection ● Example: Some large categorical features from Outbrain Click Prediction competition
  • 36. ● Hashes categorical values into vectors with fixed-length. ● Lower sparsity and higher compression compared to OHE ● Deals with new and rare categorical values (eg: new user-agents) ● May introduce collisions 100 hashed columns
  • 37. ● Instead of using the actual categorical value, use a global statistic of this category on historical data. ● Useful for both linear and non-linear algorithms ● May give collisions (same encoding for different categories) ● Be careful about leakage
  • 38. or or Counts Click-Through Rate P(click | ad) = ad_clicks / ad_views
  • 40. Factors to consider: ● Multiple time zones in some countries ● Daylight Saving Time (DST) ○ Start and end DST dates
  • 41. ● Apply binning on time data to make it categorial and more general. ● Binning a time in hours or periods of day, like below. ● Extraction: weekday/weekend, weeks, months, quarters, years... Hour range Bin ID Bin Description [5, 8) 1 Early Morning [8, 11) 2 Morning [11, 14) 3 Midday [14, 19) 4 Afternoon [19, 22) 5 Evening [22-24) and (00-05] 6 Night
  • 42. ● Instead of encoding: total spend, encode things like: Spend in last week, spend in last month, spend in last year. ● Gives a trend to the algorithm: two customers with equal spend, can have wildly different behavior — one customer may be starting to spend more, while the other is starting to decline spending.
  • 44. ● Spatial variables encode a location in space, like: ○ GPS-coordinates (lat. / long.) - sometimes require projection to a different coordinate system ○ Street Addresses - require geocoding ○ ZipCodes, Cities, States, Countries - usually enriched with the centroid coordinate of the polygon (from external GIS data) ● Derived features ○ Distance between a user location and searched hotels (Expedia competition) ○ Impossible travel speed (fraud detection)
  • 46. Reduces model complexity and training time ● Filtering - Eg. Correlation our Mutual Information between each feature and the response variable ● Wrapper methods - Expensive, trying to optimize the best subset of features (eg. Stepwise Regression) ● Embedded methods - Feature selection as part of model training process (eg. Feature Importances of Decision Trees or Trees Ensembles)
  • 47. Outbrain Click Prediction - Leaderboard score of my approaches
  • 51. Questions? Gabriel Moreira Lead Data Scientist @gspmoreira Versão extendida: bit.ly/feature_eng_tdc