Before Kaggle

Before Kaggle
From a business goal to a ML problem
Pierre
Gu(errez
@prrgu(errez

•  Data Science competitions platform
(There are others : DataScience.net in France)
•  332,000 Data Scientists
•  today : 192 competitions, 18 active
+ 516 In class, 12 active
•  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex…
What is ?

•  Price pool?
•  325,000 $ to make on August 31st
•  Good luck with that !
•  Not a good hourly wage
•  today : 192 competitions, 18 active
Understand :
•  Lot’s of datasets about approximately every DS topic
•  Lot’s of winner solutions, tip and tricks, etc…
•  Lot’s of “beat the benchmark” for beginners
I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ...
Why should I join ?

Most of the time:
•  You have a train set with labels and a test set without labels.
•  You need to learn a model using the train features and predict the test set labels
•  Your prediction is evaluated using a specific metric
•  The best prediction wins
What is a Data Science Competition ?

Most of the time:
•  You have a train set with labels and a test set without labels.
•  You need to learn a model using the train features and predict the test set labels
•  Your prediction is evaluated using a specific metric
•  The best prediction wins
What is a Data Science Competition?
Why
AUC?
F1
score?
Log
loss?

Could
that
depend
on
my
train/test
split?

Where
do
they
come
from
?
Do
you
always

have
some?

Why
is
the
split
this
way?
Random?
Time?

What you don’t learn on Kaggle (or in class?):
•  How to model a business question into a ML problem.
•  How to manage/create labels. (proxy / missing…)
•  How to evaluate a model:
•  How to choose your metric
•  How to design your train/test split
•  How to account for this in feature engineering
Understanding this actually helps you in Kaggle competition :
•  How to design your cross validation scheme (and not overfit)
•  How to create relevant features
•  Hacks and tricks (leak exploitation J)
What is a Data Science Competition?

Christophe Bourguignat DS cheat sheet
@chris_bour

Today

•  Introduction
•  Labels?
•  Train and test split?
•  Feature Engineering?
•  Evaluation Metric?
Introduction

•  Introduction
•  Labels?
•  Train and test split?
•  Feature Engineering?
•  Evaluation Metric?
Introduction
The
newcomer
disillusion

The
produc(on
bad
surprise

The
business
obfusca(on

•  Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
•  (More than) Occasional Kaggle competitor
•  Twitter @prrgutierrez
Who I am

•  Everywhere is fraud
E-business, Telco, Medicare,…
•  Easily defined as a classification problem
•  Target well defined ?
•  E-business : yes with lag
•  Elsewhere : need checks,
labels are expensive
Fraud Detection

•  Wikipedia:
“Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the
number of individuals or items moving out of a collective group over a specific period of
time”
= Customer leaving
Churn

•  Subscription models:
•  Telco
•  E-gamming (Wow)
•  Ex : Coyote -> 1 year subscription
-> you know when someone leave
•  Non subscription models:
•  E-Business (Amazon, Price Minister, Vente Privée)
•  E-gamming (Candy Crush, free MMORPG)
-> you approximate someone leaving
Candy Crush: days / weeks
MMORPG: 2 months (holidays)
Price Minister: months
Two types of Churn

•  Predict if a vehicle / machine / part is going to fail
•  Classification Problem:
•  Given a future horizon and a failure type. Will this happen for a given vehicle ?
-> 2 parameters describe the target
•  Vary a lot the target -> spurious correlation
•  Just choose it as the result of the exact business need
Predictive Maintenance

•  Target is “will like” or “will buy”
•  Target is often proxy of real interest (implicit feedback)
Recommender System

•  Can you model the problem as a ML problem?
•  Ex : predictive maintenance
•  Ask the right question from a business point of view.
Not what you know how to do.
•  Is your target a proxy?
•  Recommendation system
•  May need bandit algorithm
•  Is it easy to get labels?
•  Ex : Fraud detection
•  Can be expensive
•  Mechanical Turk can be the answer
Summary on Labels

•  Random Split
•  Just like in school
Train / test split

•  When
and
why
?

-‐>

When
each
line
is
independent
from
the

rest
(not
that
common
!)

image,
document
classiﬁca(on,
sen(ment

analysis
(“but
aha
is
the
new
lol”
)

-‐>

When
you
want
to
quickly
iterate
/

benchmark:
“is
it
even
possible?”

-‐>

When
you
want
to
sell
something
to

your
boss

•  Column / group based
Ex : Caterpillar challenge
•  Predict a price
•  for each tube id
•  Tube id in train and test
are different
Objective :
being able to generalize to
other tubes!
Train / test split

•  Time based
•  Simply separate train and test on a time variable
•  When and Why?
-> When you want a model that “predict the future”
-> When things evolve with time! (most problems!)
-> Examples :
Add click prediction, Churn prediction, E-business Fraud detection, Predictive
maintenance,…
Train / test split

•  No subscription example
•  Target : 4 month without buying
•  Features ?
Train / test split : Churn example

Ex : Train and predict scheme
Time

T
:
present
(me
T
–
4
month

Data
is
used
for
target

crea(on
:
ac(vity
during

the
last
4
months

Data
is
used
for
feature

genera(on.

Use
model
to
predict

future
churn

Train
model
using
features
and
target

Ex : Train Evaluation and Predict Scheme
Time

T
:
present
(me
T
–
4
month

Data
is
used
for
target

crea(on
:
ac(vity
during

the
last
4
months

Data
is
used
for

feature
genera(on

Valida&on
set

Use
model
to

predict
future

churn

Training

Evaluate
on
the
target

of
the
valida(on
set

T
–
8
month

Data
is
used
for
features

genera(on.

Data
is
used
for
target

crea(on
:
ac(vity
during

the
last
4
months

•  More complex design
•  Graph sampling (fraud rings ? )
•  Random sampling in client / machine life
•  Mix of column based and time based …
•  The rule :
1)  What is the problem ?
2)  To what would I like to generalize my model ?
Future ? Other individuals ? …
3)  => Train / Test split
Train / test split

•  Predictive Maintenance problem
•  Objective : predict failure in next 3 days.
•  Metric is proportional to accuracy (and 0.57 is the best score !)
•  Link to data :
https://www.phmsociety.org/events/conference/phm/14/data-challenge
EX PHM Society (Fail example)

•  Failures
EX PHM Society

•  Part Replacements
EX PHM Society

•  How to design the evaluation scheme?
•  What is the probability that an asset fail in the next 3 days from Now?
-> classification problem
-> Time based split
-> but how do I create a train and a test?
•  Choose a date and evaluate what happens 3 days later?
-> pb : not enough failures happening
•  Choose several dates for each asset?
-> beware of asset over-fitting
•  In the challenge : random selection of (asset, date) in the future + over sampling of
failures.
EX PHM Society

•  Basic Feature engineering
EX PHM Society

•  Random Sampling
EX PHM Society
This
is
decent!

«
With
some
more
work
I
could
have
a
model

that
beat
randomness
enough
to
be
useful
»

•  Time based split
EX PHM Society
Wait
what?

•  TIME LEAK
EX PHM Society

•  TIME LEAK
EX PHM Society
Tree
cuts

•  Beware of the distribution of you features!
•  Is there a time dependency?
•  Ex : count, sum, … that will only increase with time
•  -> Calculate count and sum rescaled by time / in moving windows instead.
•  Can be found in Churn, Fraud detection, Ad click prediction,…
•  A categorical variable dependency?
•  Ex : email flag in fraud detection
•  Is there a Network dependency?
•  Ex : Fraud / Bot detection (network features can be useful but leaky)
Feature Engineering

•  Final trick :
-  Stack train and test and add is_test boolean
-  Try to predict is_test
-  Check if the model is able to predict
-  If so :
-  check the feature importance
-  Remove / modify feature and iterate
Feature Engineering

•  Final trick:
•  Back to Phm example:
Feature Engineering
Huge
(me
leak
!

•  “Treshold dependant”
•  Accuracy
•  Precision and Recall
•  F1 score
•  “Treshold independant”
•  AUC
•  Log Loss
•  Others (Mean average precision)…
Evaluation metric : Classification

•  “Treshold dependant”
•  Accuracy
•  Precision and Recall
•  F1 score
•  “Treshold independant”
•  AUC
•  Log Loss
•  Others (Mean average precision)…
•  Customs
Not
good
if
unbalanced
target

When
you
have
an
order
problem

When
you
are
going
stochas(c

When
you
need
to
s(ck
to
business

Accuracy
alterna(ve

•  Custom metrics
•  Cost based
•  Ex Fraud:
•  Mean loss of 50 $ / fraud (FN)
•  Mean loss of 20 $ / wrongly cancelled transaction (FP)
•  F1 score often used in papers
•  in practice, you often have a business cost
TP
FN

TN
FP

•  Fraud Example 1:
•  “I have fraudsters on my e-business website”
•  I generate a score for each transaction
•  I handle this by manually handling transactions with score higher than threshold
•  I have 1 person that does this fulltime and able to deal with 100 transactions / day
•  The rest is automatically accepted
-> AUC is not bad
-> Recall in 100 transactions / day
-> Total money blocked 100 transactions / day
In practice AUC more stable… But the money metric can also be used for communication.

•  Fraud Example 2:
•  “I have fraudsters on my e-business website”
•  I generate a score for each transaction
•  I handle this automatically by blocking all transactions with score higher than threshold
-> AUC is not bad… But don’t give threshold value.
-> F1–Score?
-> Cost based is better

•  My cheat sheet
Metric
Op&mized
By
ML
model
?

Treshold
Dependant
Applica&on
example

Accuracy
YES
YES
image
classiﬁca(on,
nlp
…

F1-‐score
NO
YES
?
Papers
?

AUC
NO
NO
fraud
detec(on,
churn,
healthcare
…

Log-‐Loss
YES
NO
add
click
predic(on

Custom
metric
NO
?

all
?

•  Business Question dictates Evaluation Scheme!
•  test set design
•  evaluation metric
•  Indirectly impact feature engineering
•  Indirectly impact label quality
•  Think (not too much) before coding
•  Don’t try to optimize the wrong problem!
Conclusion

Before Kaggle

Recommended

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Before Kaggle (20)

Recently uploaded (20)

Before Kaggle