SlideShare a Scribd company logo
Data Preparation for Data Science
Casey Stella
@casey_stella
2016
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Table of Contents
Preliminaries
Demo
Questions
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Introduction
Hi, I’m Casey Stella!
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Garbage In =⇒ Garbage Out
“80% of the work in any data project is in cleaning the data.”
— D.J. Patel in Data Jujitsu
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Data Cleansing =⇒ Data Understanding
There are two ways to understand your data
• Syntactic Understanding
• Semantic Understanding
If you hope to get anything out of your data, you have to have a handle on both.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: True Types
A true type is a label applied to data points xi such that xi are mutually comparable.
• Schemas type != true data type
• A specific column can have many different types
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your data.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
∆Density
∆t =⇒
• Automation
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
∆Density
∆t =⇒
• Automation
• Outlier Alerting
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Semantic understanding does not imply SkyNet
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
DEMO
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Data Preparation of Data Science
Data Preparation of Data Science
Data Preparation of Data Science
Data Preparation of Data Science
Implications for Team Structure
To be successful,
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty
• Your data science teams need software engineering chops.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation page.1
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
1
http://github.com/cestella/presentations/
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Ad

More Related Content

What's hot (18)

Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise Hadoop
DataWorks Summit
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
MSAdvAnalytics
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHero
Maggie Hays
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
magda3695
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
Philippe Julio
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
Adam Doyle
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
Big Data User Group Karlsruhe/Stuttgart
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Caserta
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
Innfinision Cloud and BigData Solutions
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
DataStax
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
Streamsets Inc.
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 
Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise Hadoop
DataWorks Summit
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
MSAdvAnalytics
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHero
Maggie Hays
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
magda3695
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
Philippe Julio
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
Adam Doyle
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Caserta
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
DataStax
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
Streamsets Inc.
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 

Viewers also liked (17)

#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
MITX
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
Datamining Tools
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
DataWorks Summit/Hadoop Summit
 
The Role of Analytics in Smart Cities
The Role of Analytics in Smart CitiesThe Role of Analytics in Smart Cities
The Role of Analytics in Smart Cities
Manthan
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
DataWorks Summit/Hadoop Summit
 
H20: A platform for big math
H20: A platform for big math H20: A platform for big math
H20: A platform for big math
DataWorks Summit/Hadoop Summit
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
Ahsan Khan Eco (Superior College)
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Cloudera, Inc.
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
DataWorks Summit/Hadoop Summit
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and Processing
Mehul Gondaliya
 
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
MITX
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
Datamining Tools
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
The Role of Analytics in Smart Cities
The Role of Analytics in Smart CitiesThe Role of Analytics in Smart Cities
The Role of Analytics in Smart Cities
Manthan
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
DataWorks Summit/Hadoop Summit
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
Ahsan Khan Eco (Superior College)
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Cloudera, Inc.
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and Processing
Mehul Gondaliya
 
Ad

Similar to Data Preparation of Data Science (20)

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
DataWorks Summit
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
DataWorks Summit
 
Streaming Outlier Analysis for Fun and Scalability
Streaming Outlier Analysis for Fun and Scalability Streaming Outlier Analysis for Fun and Scalability
Streaming Outlier Analysis for Fun and Scalability
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey Stella
Spark Summit
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
DataWorks Summit/Hadoop Summit
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
DataWorks Summit/Hadoop Summit
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
Stephen Withington
 
The Use of Data and Datasets in Data Science
The Use of Data and Datasets in Data ScienceThe Use of Data and Datasets in Data Science
The Use of Data and Datasets in Data Science
Damian T. Gordon
 
The field-guide-to-data-science
The field-guide-to-data-scienceThe field-guide-to-data-science
The field-guide-to-data-science
Booz Allen Hamilton
 
Data Lifecycle Risks Considerations and Controls
Data Lifecycle Risks Considerations and ControlsData Lifecycle Risks Considerations and Controls
Data Lifecycle Risks Considerations and Controls
Carlos Chalico
 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
Booz Allen Hamilton
 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
EMC
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
ryanorban
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
Susanna-Assunta Sansone
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
Albert Anthony Gavino, MBA
 
Enrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolatorEnrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolator
Lucian Precup
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
James Hendler
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
Andre Freitas
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
TJ Stalcup
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
DataWorks Summit
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
DataWorks Summit
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey Stella
Spark Summit
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
Stephen Withington
 
The Use of Data and Datasets in Data Science
The Use of Data and Datasets in Data ScienceThe Use of Data and Datasets in Data Science
The Use of Data and Datasets in Data Science
Damian T. Gordon
 
Data Lifecycle Risks Considerations and Controls
Data Lifecycle Risks Considerations and ControlsData Lifecycle Risks Considerations and Controls
Data Lifecycle Risks Considerations and Controls
Carlos Chalico
 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
EMC
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
ryanorban
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
Susanna-Assunta Sansone
 
Enrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolatorEnrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolator
Lucian Precup
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
James Hendler
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
Andre Freitas
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
TJ Stalcup
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Datastucture-Unit 4-Linked List Presentation.pptx
Datastucture-Unit 4-Linked List Presentation.pptxDatastucture-Unit 4-Linked List Presentation.pptx
Datastucture-Unit 4-Linked List Presentation.pptx
kaleeswaric3
 
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
Lynda Kane
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Rock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning JourneyRock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning Journey
Lynda Kane
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Datastucture-Unit 4-Linked List Presentation.pptx
Datastucture-Unit 4-Linked List Presentation.pptxDatastucture-Unit 4-Linked List Presentation.pptx
Datastucture-Unit 4-Linked List Presentation.pptx
kaleeswaric3
 
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
Lynda Kane
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Rock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning JourneyRock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning Journey
Lynda Kane
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
 

Data Preparation of Data Science

  • 1. Data Preparation for Data Science Casey Stella @casey_stella 2016 Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 2. Table of Contents Preliminaries Demo Questions Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 3. Introduction Hi, I’m Casey Stella! Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 4. Garbage In =⇒ Garbage Out “80% of the work in any data project is in cleaning the data.” — D.J. Patel in Data Jujitsu Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 5. Data Cleansing =⇒ Data Understanding There are two ways to understand your data • Syntactic Understanding • Semantic Understanding If you hope to get anything out of your data, you have to have a handle on both. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 6. Syntactic Understanding: True Types A true type is a label applied to data points xi such that xi are mutually comparable. • Schemas type != true data type • A specific column can have many different types Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 7. Syntactic Understanding: Density Data density is an indication of how data is clumped together. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 8. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 9. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 10. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format • Replacing digits with the character ‘d’ • Stripping whitespace • Normalizing punctuation Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 11. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format • Replacing digits with the character ‘d’ • Stripping whitespace • Normalizing punctuation Data density is an assumption underlying any conclusions drawn from your data. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 12. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 13. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 14. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 15. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated ∆Density ∆t =⇒ • Automation Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 16. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated ∆Density ∆t =⇒ • Automation • Outlier Alerting Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 17. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 18. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 19. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 20. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) • May come from machine learning (e.g. analyzing usage patterns to find synonyms) Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 21. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) • May come from machine learning (e.g. analyzing usage patterns to find synonyms) Semantic understanding does not imply SkyNet Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 22. DEMO Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 27. Implications for Team Structure To be successful, Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 28. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 29. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 30. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty • Your data science teams have to be allowed to get their hands dirty Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 31. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty • Your data science teams have to be allowed to get their hands dirty • Your data science teams need software engineering chops. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 32. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.1 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: [email protected] 1 http://github.com/cestella/presentations/ Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016