Make Sense Out of Data with Feature Engineering

Make Sense Out of Data
with Feature Engineering
Xavier Conort
Chief Data Scientist at DataRobot
@Melbourne Data Science Initiative 2016

Agenda
Preamble
2 examples:
Key takeaways

Automation is integral part of human civilization

Car
Destination Crude oil
Refined Oil
process oil into more useful products such gasoline
A successful journeyKey elements for a successful car journey

Car = Modelling engine
Machine Learning solutions replace more and more traditional statistical
approach and can automate the modelling process and produce world-
class predictive accuracy without much effort
Destination = Outcome
well defined outcome to predict and well defined
process to use it to optimize business problems
Crude Oil = Raw Data
increased volume and capacity to handle
terabytes of Data
Refined oil = Feature Engineering
talent to extract from raw data
information that can be used by models
open source
programming
social network of
coders
automated
solutions
Key elements for a successful data science journey

Refined oil for Machine Learning:
Flat File Data Format
© DataRobot, Inc. All rights reserved. Confidential
● 1 record per prediction event
● 1 column for each predictive field / feature
● 1 column for the value to be predicted
(training data only)
6
ID Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Target
1 0.73 Female 5.28 Thursday 37.52 Yes
2 0.20 Male 4.20 Tuesday 35.04 Yes
3 1.82 Male 14.71 Friday 7.02 Yes
4 -0.69 Female 11.82 Sunday -3.29 No
5 1.07 Male 16.55 Monday 12.59 Yes
6 -0.27 Male 10.87 Thursday -8.19 Yes
7 2.88 Male 5.24 Wednesday -21.67 No
8 1.35 Female 9.40 Tuesday 9.70 Yes
9 0.73 Female 1.04 Sunday 26.60 Yes
10 0.02 Female -9.79 Saturday -14.47 Yes
11 3.43 Male 11.59 Thursday 27.48 No
12 2.56 Female -13.25 Saturday 12.41 No

Feature Engineering that we will cover today
● Variables you should not use
● Dealing with high dimensional features
● Using external data to add valuable information
● Dealing with transactional data
7

8
● Hosted by Practice Fusion, a cloud-based electronic health record
platform for doctors and patients
● Challenge: Given a de-identified data set of patient electronic health
records, build a model to determine who has a diabetes diagnosis
● Data:
○ 17 tables containing 4 years history of medical records!
Example 1:

Think of variables you should not use
Feature Engineering the YearOfBirth Value
● We expect that as a patient gets older their risk of
diabetes will increase, yet their YearOfBirth value
remains static
● We need a feature that changes as the patient gets
older
● The true predictor of diabetes is more likely to be age
than year of birth
● The data is extracted at the end of 2012
● Age = 2012 - YearOfBirth
9

Learn to deal with high dimensionality
10
● Add characteristics of levels of categorical
features to see how similar they are
● Use location for regional categories to see how
close they are
● Group hierarchical categories together based
on those hierarchies
● Text mine the descriptions
● Use the overall ordinal frequency ranking as a
feature
● Top Kagglers likelihood / credibility features

Case study: engineer state variable
11
● So that the machine learning algorithms can
know which states are near (and possibly
similar to) each other
● Centroid for each of the 51 states
http://dev.maxmind.com/geoip/legacy/codes/state_latlon/

Case study: engineer diagnosis
12
● Use the ICD9 code groupings
https://en.wikipedia.org/wiki/List_of_ICD-9_codes
○ So that the machine learning algorithms
can know which diagnoses are similar to
each other
○ Count the observations in each group
● Use the ICD9 descriptions
○ Do text mining on the descriptions to
find words or phrases within the
descriptions

Case Study: engineer drugs
13
● Use drug databases
http://www.fda.gov/drugs/informationondrugs/ucm142438.htm
● To enable the machine learning algorithm
to know which drugs are similar:
○ Replace proprietary brand names with
generic medication names
○ Text mine the list of pharmaceutical
classes

But What About Relational Databases?
14
Challenge: many records per patient
○ 9948 patients
○ 196,290 transcripts
○ 142,741 diagnostics
○ 66,487 medications
○ 3,030 lab results

Deal with one to many relationships
● create predictive fields using summary
statistics
○ e.g. averages of last 24 hours / week /
month / year
○ e.g. variance or standard deviation
○ e.g. entropy
○ e.g. maximum or minimum values
○ e.g. counts
○ e.g. most frequent value
○ e.g. sequences of events
15
ID Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Target
1 0.73 Female 5.28 Thursday 37.52 Yes
2 0.20 Male 4.20 Tuesday 35.04 Yes
3 1.82 Male 14.71 Friday 7.02 Yes
4 -0.69 Female 11.82 Sunday -3.29 No
5 1.07 Male 16.55 Monday 12.59 Yes
6 -0.27 Male 10.87 Thursday -8.19 Yes
7 2.88 Male 5.24 Wednesday -21.67 No
8 1.35 Female 9.40 Tuesday 9.70 Yes
9 0.73 Female 1.04 Sunday 26.60 Yes
10 0.02 Female -9.79 Saturday -14.47 Yes
11 3.43 Male 11.59 Thursday 27.48 No
12 2.56 Female -13.25 Saturday 12.41 No

Case Study: Relationship Between a Patient
and Diagnosis
● One patient to many diagnoses
● Uniquely joined via PatientGuid
16

Case Study: Build sequences string
Feature Engineering the ICD9 Codes
● One patient can have from 1 to 75 diagnoses
● We need to compress this data to a single record
per patient
● One way is to concatenate the sequence of
diagnoses into a string and do text mining models
that use ngrams on that string
● It often helps to remove consecutive duplicates
● Sometimes it is useful to know the first and last
events, and the times of those events
17
Patient ID ICD9Code
12345 391
12345 401
12345 401
12345 454.1
12346 410.3
12346 463

Case Study: Count and entropy
Feature Engineering the ICD9 Codes
per patient
● Sometimes it is useful to know
○ The number of events
○ The most common event type
○ The level of variety of event types e.g. the
entropy (as my colleague Owen Zhang did
for the KKD Cup 2015)
18
Patient ID ICD9Code
12345 391
12345 401
12345 401
12345 454.1
12346 410.3
12346 463
-propn*ln(propn)
= -0.25*ln(0.25)
=0.347

Case Study: Timing stats
Feature Engineering the Timing of Diagnoses
per patient
● Sometimes it is useful to know information about
the timing of events
○ The range of event times e.g. mean, median,
maximum, minimum, quantiles
○ The amount of time between events e.g. mean,
median, maximum, minimum, quantiles
○ The regularity of the timing of events e.g.
variance
19

Hosted by XuetangX, a Chinese MOOC learning platform initiated by Tsinghua
University
Challenge: predict whether a user will drop a course within next 10 days based on his or
her prior activities.
Data:
enrollment_train (120K rows) / enrollment_test (80K rows):
Columns: enrollment_id, username, course_id
log_train / log_test
Columns: enrollment_id, time, source, event, object
object
Columns: course_id, module_id, category, children, start
truth_train
Columns: enrollment_id, dropped_out
Example 2:

We applied same recipes to log data
5890
objects
and generated a flat file with 100s of
features!!!

Techniques we used in
… to describe course, enrollment and students from log
data:
counts
time statistics (min, mean, max, diff)
entropy
sequences treated as text on which we ran
SVD and logistic regression on 3grams
20 first components of SVD on user x object
More can be found in http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost

Key takeaways
Machine Learning (ML) can automatically generate world class
predictive accuracy
But feature engineering is still an art that requires a lot of creativity,
business insight, curiosity and effort
Be careful! Infinite number of features can be generated… Start with
winning recipes (steal them from others and make up your own)
and then iterate with new recipes, ideas, external data... Stop when
you don’t get much additional accuracy

Make Sense Out of Data with Feature Engineering

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Make Sense Out of Data with Feature Engineering (20)

Recently uploaded (20)

Make Sense Out of Data with Feature Engineering