SlideShare a Scribd company logo
Make Sense Out of Data
with Feature Engineering
Xavier Conort
Chief Data Scientist at DataRobot
@Melbourne Data Science Initiative 2016
Agenda
Preamble
2 examples:
Key takeaways
Automation is integral part of human civilization
Car
Destination Crude oil
Refined Oil
process oil into more useful products such gasoline
A successful journeyKey elements for a successful car journey
Car = Modelling engine
Machine Learning solutions replace more and more traditional statistical
approach and can automate the modelling process and produce world-
class predictive accuracy without much effort
Destination = Outcome
well defined outcome to predict and well defined
process to use it to optimize business problems
Crude Oil = Raw Data
increased volume and capacity to handle
terabytes of Data
Refined oil = Feature Engineering
talent to extract from raw data
information that can be used by models
open source
programming
social network of
coders
automated
solutions
Key elements for a successful data science journey
Refined oil for Machine Learning:
Flat File Data Format
© DataRobot, Inc. All rights reserved. Confidential
● 1 record per prediction event
● 1 column for each predictive field / feature
● 1 column for the value to be predicted
(training data only)
6
ID Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Target
1 0.73 Female 5.28 Thursday 37.52 Yes
2 0.20 Male 4.20 Tuesday 35.04 Yes
3 1.82 Male 14.71 Friday 7.02 Yes
4 -0.69 Female 11.82 Sunday -3.29 No
5 1.07 Male 16.55 Monday 12.59 Yes
6 -0.27 Male 10.87 Thursday -8.19 Yes
7 2.88 Male 5.24 Wednesday -21.67 No
8 1.35 Female 9.40 Tuesday 9.70 Yes
9 0.73 Female 1.04 Sunday 26.60 Yes
10 0.02 Female -9.79 Saturday -14.47 Yes
11 3.43 Male 11.59 Thursday 27.48 No
12 2.56 Female -13.25 Saturday 12.41 No
Feature Engineering that we will cover today
● Variables you should not use
● Dealing with high dimensional features
● Using external data to add valuable information
● Dealing with transactional data
© DataRobot, Inc. All rights reserved. Confidential
7
8
● Hosted by Practice Fusion, a cloud-based electronic health record
platform for doctors and patients
● Challenge: Given a de-identified data set of patient electronic health
records, build a model to determine who has a diabetes diagnosis
● Data:
○ 17 tables containing 4 years history of medical records!
Example 1:
Think of variables you should not use
© DataRobot, Inc. All rights reserved. Confidential
Feature Engineering the YearOfBirth Value
● We expect that as a patient gets older their risk of
diabetes will increase, yet their YearOfBirth value
remains static
● We need a feature that changes as the patient gets
older
● The true predictor of diabetes is more likely to be age
than year of birth
● The data is extracted at the end of 2012
● Age = 2012 - YearOfBirth
9
Learn to deal with high dimensionality
© DataRobot, Inc. All rights reserved. Confidential
10
● Add characteristics of levels of categorical
features to see how similar they are
● Use location for regional categories to see how
close they are
● Group hierarchical categories together based
on those hierarchies
● Text mine the descriptions
● Use the overall ordinal frequency ranking as a
feature
● Top Kagglers likelihood / credibility features
Case study: engineer state variable
© DataRobot, Inc. All rights reserved. Confidential
11
● So that the machine learning algorithms can
know which states are near (and possibly
similar to) each other
● Centroid for each of the 51 states
http://dev.maxmind.com/geoip/legacy/codes/state_latlon/
Case study: engineer diagnosis
© DataRobot, Inc. All rights reserved. Confidential
12
● Use the ICD9 code groupings
https://en.wikipedia.org/wiki/List_of_ICD-9_codes
○ So that the machine learning algorithms
can know which diagnoses are similar to
each other
○ Count the observations in each group
● Use the ICD9 descriptions
○ Do text mining on the descriptions to
find words or phrases within the
descriptions
Case Study: engineer drugs
© DataRobot, Inc. All rights reserved. Confidential
13
● Use drug databases
http://www.fda.gov/drugs/informationondrugs/ucm142438.htm
● To enable the machine learning algorithm
to know which drugs are similar:
○ Replace proprietary brand names with
generic medication names
○ Text mine the list of pharmaceutical
classes
But What About Relational Databases?
© DataRobot, Inc. All rights reserved. Confidential
14
Challenge: many records per patient
○ 9948 patients
○ 196,290 transcripts
○ 142,741 diagnostics
○ 66,487 medications
○ 3,030 lab results
Deal with one to many relationships
© DataRobot, Inc. All rights reserved. Confidential
● create predictive fields using summary
statistics
○ e.g. averages of last 24 hours / week /
month / year
○ e.g. variance or standard deviation
○ e.g. entropy
○ e.g. maximum or minimum values
○ e.g. counts
○ e.g. most frequent value
○ e.g. sequences of events
15
ID Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Target
1 0.73 Female 5.28 Thursday 37.52 Yes
2 0.20 Male 4.20 Tuesday 35.04 Yes
3 1.82 Male 14.71 Friday 7.02 Yes
4 -0.69 Female 11.82 Sunday -3.29 No
5 1.07 Male 16.55 Monday 12.59 Yes
6 -0.27 Male 10.87 Thursday -8.19 Yes
7 2.88 Male 5.24 Wednesday -21.67 No
8 1.35 Female 9.40 Tuesday 9.70 Yes
9 0.73 Female 1.04 Sunday 26.60 Yes
10 0.02 Female -9.79 Saturday -14.47 Yes
11 3.43 Male 11.59 Thursday 27.48 No
12 2.56 Female -13.25 Saturday 12.41 No
Case Study: Relationship Between a Patient
and Diagnosis
© DataRobot, Inc. All rights reserved. Confidential
● One patient to many diagnoses
● Uniquely joined via PatientGuid
16
Case Study: Build sequences string
© DataRobot, Inc. All rights reserved. Confidential
Feature Engineering the ICD9 Codes
● One patient can have from 1 to 75 diagnoses
● We need to compress this data to a single record
per patient
● One way is to concatenate the sequence of
diagnoses into a string and do text mining models
that use ngrams on that string
● It often helps to remove consecutive duplicates
● Sometimes it is useful to know the first and last
events, and the times of those events
17
Patient ID ICD9Code
12345 391
12345 401
12345 401
12345 454.1
12346 410.3
12346 463
Case Study: Count and entropy
© DataRobot, Inc. All rights reserved. Confidential
Feature Engineering the ICD9 Codes
● One patient can have from 1 to 75 diagnoses
● We need to compress this data to a single record
per patient
● Sometimes it is useful to know
○ The number of events
○ The most common event type
○ The level of variety of event types e.g. the
entropy (as my colleague Owen Zhang did
for the KKD Cup 2015)
18
Patient ID ICD9Code
12345 391
12345 401
12345 401
12345 454.1
12346 410.3
12346 463
-propn*ln(propn)
= -0.25*ln(0.25)
=0.347
Case Study: Timing stats
© DataRobot, Inc. All rights reserved. Confidential
Feature Engineering the Timing of Diagnoses
● One patient can have from 1 to 75 diagnoses
● We need to compress this data to a single record
per patient
● Sometimes it is useful to know information about
the timing of events
○ The range of event times e.g. mean, median,
maximum, minimum, quantiles
○ The amount of time between events e.g. mean,
median, maximum, minimum, quantiles
○ The regularity of the timing of events e.g.
variance
19
Hosted by XuetangX, a Chinese MOOC learning platform initiated by Tsinghua
University
Challenge: predict whether a user will drop a course within next 10 days based on his or
her prior activities.
Data:
enrollment_train (120K rows) / enrollment_test (80K rows):
Columns: enrollment_id, username, course_id
log_train / log_test
Columns: enrollment_id, time, source, event, object
object
Columns: course_id, module_id, category, children, start
truth_train
Columns: enrollment_id, dropped_out
Example 2:
We applied same recipes to log data
5890
objects
and generated a flat file with 100s of
features!!!
Techniques we used in
… to describe course, enrollment and students from log
data:
counts
time statistics (min, mean, max, diff)
entropy
sequences treated as text on which we ran
SVD and logistic regression on 3grams
20 first components of SVD on user x object
More can be found in http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost
Key takeaways
Machine Learning (ML) can automatically generate world class
predictive accuracy
But feature engineering is still an art that requires a lot of creativity,
business insight, curiosity and effort
Be careful! Infinite number of features can be generated… Start with
winning recipes (steal them from others and make up your own)
and then iterate with new recipes, ideas, external data... Stop when
you don’t get much additional accuracy
Ad

More Related Content

What's hot (20)

10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
DataRobot
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
kaggle_meet_up
kaggle_meet_upkaggle_meet_up
kaggle_meet_up
Marios Michailidis
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
Machine Learning Valencia
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
Knoldus Inc.
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
Machine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainMachine learning_ Replicating Human Brain
Machine learning_ Replicating Human Brain
Nishant Jain
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
Felipe
 
H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDell
Sri Ambati
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
Haptik
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
machine learning
machine learningmachine learning
machine learning
Mounisha A
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
Rebecca Bilbro
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
Rebecca Bilbro
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
bodaceacat
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hype
Jorge Ferrer
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
Lucinda Linde
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
DataRobot
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
Knoldus Inc.
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
Machine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainMachine learning_ Replicating Human Brain
Machine learning_ Replicating Human Brain
Nishant Jain
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
Felipe
 
H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDell
Sri Ambati
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
Haptik
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
machine learning
machine learningmachine learning
machine learning
Mounisha A
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
Rebecca Bilbro
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
Rebecca Bilbro
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
bodaceacat
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hype
Jorge Ferrer
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
Lucinda Linde
 

Viewers also liked (20)

HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
HackerEarth
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
HackerEarth
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
DataRobot
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
HackerEarth
 
Tda presentation
Tda presentationTda presentation
Tda presentation
HJ van Veen
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing Solution
HackerEarth
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Sharat Chikkerur
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarth
HackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
HackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
Jeong-Yoon Lee
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and Recruiting
HackerEarth
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
Joe Kleinwaechter
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
HackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
HackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
Jeong-Yoon Lee
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
Antti Haapala
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
HackerEarth
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
HackerEarth
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
DataRobot
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
HackerEarth
 
Tda presentation
Tda presentationTda presentation
Tda presentation
HJ van Veen
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing Solution
HackerEarth
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Sharat Chikkerur
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarth
HackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
HackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
Jeong-Yoon Lee
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and Recruiting
HackerEarth
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
HackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
HackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
Jeong-Yoon Lee
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
Antti Haapala
 
Ad

Similar to Make Sense Out of Data with Feature Engineering (20)

The Data Scientist’s Toolkit: Key Techniques for Extracting Value
The Data Scientist’s Toolkit: Key Techniques for Extracting ValueThe Data Scientist’s Toolkit: Key Techniques for Extracting Value
The Data Scientist’s Toolkit: Key Techniques for Extracting Value
pallavichauhan2525
 
Building successful and secure products with AI and ML
Building successful and secure products with AI and MLBuilding successful and secure products with AI and ML
Building successful and secure products with AI and ML
Simon Lia-Jonassen
 
How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems? How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems?
priyanka rajput
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITO
MarcoMellia
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
Databricks
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
PrashantYadav931011
 
CTMS Data Migration by Krishnaveni Rapuru
CTMS Data Migration  by Krishnaveni RapuruCTMS Data Migration  by Krishnaveni Rapuru
CTMS Data Migration by Krishnaveni Rapuru
MuraliRaj M
 
Training and deploying an image classification model
Training and deploying an image classification modelTraining and deploying an image classification model
Training and deploying an image classification model
Knoldus Inc.
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Bigfinite
 
Applying Capability Modelling in the Genomics Diagnosis Domain: Lessons Learned
Applying Capability Modelling in the Genomics Diagnosis Domain: Lessons Learned Applying Capability Modelling in the Genomics Diagnosis Domain: Lessons Learned
Applying Capability Modelling in the Genomics Diagnosis Domain: Lessons Learned
CaaS EU FP7 Project
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
Ian Feller
 
Hybrid approach for safer medical diagnosis using Privacy preserving Machine ...
Hybrid approach for safer medical diagnosis using Privacy preserving Machine ...Hybrid approach for safer medical diagnosis using Privacy preserving Machine ...
Hybrid approach for safer medical diagnosis using Privacy preserving Machine ...
dwaynedogra
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
 
Comparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining SoftwareComparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining Software
Universitas Pembangunan Panca Budi
 
Predicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AIPredicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AI
Sri Ambati
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
Akanksha Gohil
 
Philly ETE 2016: Securing Software by Construction
Philly ETE 2016: Securing Software by ConstructionPhilly ETE 2016: Securing Software by Construction
Philly ETE 2016: Securing Software by Construction
jxyz
 
Data science guide
Data science guideData science guide
Data science guide
gokulprasath06
 
The Data Scientist’s Toolkit: Key Techniques for Extracting Value
The Data Scientist’s Toolkit: Key Techniques for Extracting ValueThe Data Scientist’s Toolkit: Key Techniques for Extracting Value
The Data Scientist’s Toolkit: Key Techniques for Extracting Value
pallavichauhan2525
 
Building successful and secure products with AI and ML
Building successful and secure products with AI and MLBuilding successful and secure products with AI and ML
Building successful and secure products with AI and ML
Simon Lia-Jonassen
 
How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems? How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems?
priyanka rajput
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITO
MarcoMellia
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
Databricks
 
CTMS Data Migration by Krishnaveni Rapuru
CTMS Data Migration  by Krishnaveni RapuruCTMS Data Migration  by Krishnaveni Rapuru
CTMS Data Migration by Krishnaveni Rapuru
MuraliRaj M
 
Training and deploying an image classification model
Training and deploying an image classification modelTraining and deploying an image classification model
Training and deploying an image classification model
Knoldus Inc.
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Bigfinite
 
Applying Capability Modelling in the Genomics Diagnosis Domain: Lessons Learned
Applying Capability Modelling in the Genomics Diagnosis Domain: Lessons Learned Applying Capability Modelling in the Genomics Diagnosis Domain: Lessons Learned
Applying Capability Modelling in the Genomics Diagnosis Domain: Lessons Learned
CaaS EU FP7 Project
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
Ian Feller
 
Hybrid approach for safer medical diagnosis using Privacy preserving Machine ...
Hybrid approach for safer medical diagnosis using Privacy preserving Machine ...Hybrid approach for safer medical diagnosis using Privacy preserving Machine ...
Hybrid approach for safer medical diagnosis using Privacy preserving Machine ...
dwaynedogra
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
 
Comparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining SoftwareComparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining Software
Universitas Pembangunan Panca Budi
 
Predicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AIPredicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AI
Sri Ambati
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
Akanksha Gohil
 
Philly ETE 2016: Securing Software by Construction
Philly ETE 2016: Securing Software by ConstructionPhilly ETE 2016: Securing Software by Construction
Philly ETE 2016: Securing Software by Construction
jxyz
 
Ad

Recently uploaded (20)

定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Understanding Complex Development Processes
Understanding Complex Development ProcessesUnderstanding Complex Development Processes
Understanding Complex Development Processes
Process mining Evangelist
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 

Make Sense Out of Data with Feature Engineering

  • 1. Make Sense Out of Data with Feature Engineering Xavier Conort Chief Data Scientist at DataRobot @Melbourne Data Science Initiative 2016
  • 3. Automation is integral part of human civilization
  • 4. Car Destination Crude oil Refined Oil process oil into more useful products such gasoline A successful journeyKey elements for a successful car journey
  • 5. Car = Modelling engine Machine Learning solutions replace more and more traditional statistical approach and can automate the modelling process and produce world- class predictive accuracy without much effort Destination = Outcome well defined outcome to predict and well defined process to use it to optimize business problems Crude Oil = Raw Data increased volume and capacity to handle terabytes of Data Refined oil = Feature Engineering talent to extract from raw data information that can be used by models open source programming social network of coders automated solutions Key elements for a successful data science journey
  • 6. Refined oil for Machine Learning: Flat File Data Format © DataRobot, Inc. All rights reserved. Confidential ● 1 record per prediction event ● 1 column for each predictive field / feature ● 1 column for the value to be predicted (training data only) 6 ID Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Target 1 0.73 Female 5.28 Thursday 37.52 Yes 2 0.20 Male 4.20 Tuesday 35.04 Yes 3 1.82 Male 14.71 Friday 7.02 Yes 4 -0.69 Female 11.82 Sunday -3.29 No 5 1.07 Male 16.55 Monday 12.59 Yes 6 -0.27 Male 10.87 Thursday -8.19 Yes 7 2.88 Male 5.24 Wednesday -21.67 No 8 1.35 Female 9.40 Tuesday 9.70 Yes 9 0.73 Female 1.04 Sunday 26.60 Yes 10 0.02 Female -9.79 Saturday -14.47 Yes 11 3.43 Male 11.59 Thursday 27.48 No 12 2.56 Female -13.25 Saturday 12.41 No
  • 7. Feature Engineering that we will cover today ● Variables you should not use ● Dealing with high dimensional features ● Using external data to add valuable information ● Dealing with transactional data © DataRobot, Inc. All rights reserved. Confidential 7
  • 8. 8 ● Hosted by Practice Fusion, a cloud-based electronic health record platform for doctors and patients ● Challenge: Given a de-identified data set of patient electronic health records, build a model to determine who has a diabetes diagnosis ● Data: ○ 17 tables containing 4 years history of medical records! Example 1:
  • 9. Think of variables you should not use © DataRobot, Inc. All rights reserved. Confidential Feature Engineering the YearOfBirth Value ● We expect that as a patient gets older their risk of diabetes will increase, yet their YearOfBirth value remains static ● We need a feature that changes as the patient gets older ● The true predictor of diabetes is more likely to be age than year of birth ● The data is extracted at the end of 2012 ● Age = 2012 - YearOfBirth 9
  • 10. Learn to deal with high dimensionality © DataRobot, Inc. All rights reserved. Confidential 10 ● Add characteristics of levels of categorical features to see how similar they are ● Use location for regional categories to see how close they are ● Group hierarchical categories together based on those hierarchies ● Text mine the descriptions ● Use the overall ordinal frequency ranking as a feature ● Top Kagglers likelihood / credibility features
  • 11. Case study: engineer state variable © DataRobot, Inc. All rights reserved. Confidential 11 ● So that the machine learning algorithms can know which states are near (and possibly similar to) each other ● Centroid for each of the 51 states http://dev.maxmind.com/geoip/legacy/codes/state_latlon/
  • 12. Case study: engineer diagnosis © DataRobot, Inc. All rights reserved. Confidential 12 ● Use the ICD9 code groupings https://en.wikipedia.org/wiki/List_of_ICD-9_codes ○ So that the machine learning algorithms can know which diagnoses are similar to each other ○ Count the observations in each group ● Use the ICD9 descriptions ○ Do text mining on the descriptions to find words or phrases within the descriptions
  • 13. Case Study: engineer drugs © DataRobot, Inc. All rights reserved. Confidential 13 ● Use drug databases http://www.fda.gov/drugs/informationondrugs/ucm142438.htm ● To enable the machine learning algorithm to know which drugs are similar: ○ Replace proprietary brand names with generic medication names ○ Text mine the list of pharmaceutical classes
  • 14. But What About Relational Databases? © DataRobot, Inc. All rights reserved. Confidential 14 Challenge: many records per patient ○ 9948 patients ○ 196,290 transcripts ○ 142,741 diagnostics ○ 66,487 medications ○ 3,030 lab results
  • 15. Deal with one to many relationships © DataRobot, Inc. All rights reserved. Confidential ● create predictive fields using summary statistics ○ e.g. averages of last 24 hours / week / month / year ○ e.g. variance or standard deviation ○ e.g. entropy ○ e.g. maximum or minimum values ○ e.g. counts ○ e.g. most frequent value ○ e.g. sequences of events 15 ID Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Target 1 0.73 Female 5.28 Thursday 37.52 Yes 2 0.20 Male 4.20 Tuesday 35.04 Yes 3 1.82 Male 14.71 Friday 7.02 Yes 4 -0.69 Female 11.82 Sunday -3.29 No 5 1.07 Male 16.55 Monday 12.59 Yes 6 -0.27 Male 10.87 Thursday -8.19 Yes 7 2.88 Male 5.24 Wednesday -21.67 No 8 1.35 Female 9.40 Tuesday 9.70 Yes 9 0.73 Female 1.04 Sunday 26.60 Yes 10 0.02 Female -9.79 Saturday -14.47 Yes 11 3.43 Male 11.59 Thursday 27.48 No 12 2.56 Female -13.25 Saturday 12.41 No
  • 16. Case Study: Relationship Between a Patient and Diagnosis © DataRobot, Inc. All rights reserved. Confidential ● One patient to many diagnoses ● Uniquely joined via PatientGuid 16
  • 17. Case Study: Build sequences string © DataRobot, Inc. All rights reserved. Confidential Feature Engineering the ICD9 Codes ● One patient can have from 1 to 75 diagnoses ● We need to compress this data to a single record per patient ● One way is to concatenate the sequence of diagnoses into a string and do text mining models that use ngrams on that string ● It often helps to remove consecutive duplicates ● Sometimes it is useful to know the first and last events, and the times of those events 17 Patient ID ICD9Code 12345 391 12345 401 12345 401 12345 454.1 12346 410.3 12346 463
  • 18. Case Study: Count and entropy © DataRobot, Inc. All rights reserved. Confidential Feature Engineering the ICD9 Codes ● One patient can have from 1 to 75 diagnoses ● We need to compress this data to a single record per patient ● Sometimes it is useful to know ○ The number of events ○ The most common event type ○ The level of variety of event types e.g. the entropy (as my colleague Owen Zhang did for the KKD Cup 2015) 18 Patient ID ICD9Code 12345 391 12345 401 12345 401 12345 454.1 12346 410.3 12346 463 -propn*ln(propn) = -0.25*ln(0.25) =0.347
  • 19. Case Study: Timing stats © DataRobot, Inc. All rights reserved. Confidential Feature Engineering the Timing of Diagnoses ● One patient can have from 1 to 75 diagnoses ● We need to compress this data to a single record per patient ● Sometimes it is useful to know information about the timing of events ○ The range of event times e.g. mean, median, maximum, minimum, quantiles ○ The amount of time between events e.g. mean, median, maximum, minimum, quantiles ○ The regularity of the timing of events e.g. variance 19
  • 20. Hosted by XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University Challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities. Data: enrollment_train (120K rows) / enrollment_test (80K rows): Columns: enrollment_id, username, course_id log_train / log_test Columns: enrollment_id, time, source, event, object object Columns: course_id, module_id, category, children, start truth_train Columns: enrollment_id, dropped_out Example 2:
  • 21. We applied same recipes to log data 5890 objects and generated a flat file with 100s of features!!!
  • 22. Techniques we used in … to describe course, enrollment and students from log data: counts time statistics (min, mean, max, diff) entropy sequences treated as text on which we ran SVD and logistic regression on 3grams 20 first components of SVD on user x object More can be found in http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost
  • 23. Key takeaways Machine Learning (ML) can automatically generate world class predictive accuracy But feature engineering is still an art that requires a lot of creativity, business insight, curiosity and effort Be careful! Infinite number of features can be generated… Start with winning recipes (steal them from others and make up your own) and then iterate with new recipes, ideas, external data... Stop when you don’t get much additional accuracy